<a href="https://colab.research.google.com/github/raulbs7/Machine-Learning-Techniques-Project/blob/master/NLP_Supervised_Project/2_Vectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2. VECTORIZATION

## 2.1 Imports

First of all, as always, we do the imports for the project.

In [None]:
from google.colab import files
from google.colab import drive
import pandas as pd
import nltk
import io
import re
import numpy as np
import scipy.sparse as sp
from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.tokenize import word_tokenize

nltk.download("popular")
nltk.download('vader_lexicon') #sentiment analysis
nltk.download('twython') #twitter

from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

In [None]:
drive.mount('/drive')

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).


## 2.2 Importing dataset

The dataset that it is going to take for this phase is called **processed_tweets.csv**

In [None]:
def upload_dataframes (index_fields):
  uploaded = files.upload()
  for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
        name=fn, length=len(uploaded[fn])))
    df = pd.read_csv(io.StringIO(uploaded[fn].decode('utf-8')), index_col = index_fields)
    return df

In [None]:
tweets = upload_dataframes([])

Saving processed_tweets.csv to processed_tweets (1).csv
User uploaded file "processed_tweets.csv" with length 1558482 bytes


In [None]:
print(tweets.shape)
tweets.head()

(24783, 7)


Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,rt woman complain cleaning house man always ta...
1,1,3,0,3,0,1,rt boy dat coldtyga dwn bad cuffin dat hoe 1st...
2,2,3,0,3,0,1,rt dawg rt ever fuck bitch start cry confused ...
3,3,3,0,2,1,1,rt look like tranny
4,4,6,0,6,0,1,rt shit hear might true might faker bitch told...


## 2.3 TFIDF

The first configuration that is going to be extracted form the tweets is TFIDF, that it is similar to the CountVectorizer, but the principal difference is that TFIDF can express the relevance of each feature. The criteria is going to be to remove the features that doesn't appear in 4 tweets at least.

In [None]:
def tfidf_vectorizer(matrix):
  vectorizer = TfidfVectorizer(min_df=4, lowercase=False, stop_words='english')
  return vectorizer.fit_transform(matrix)

In [None]:
tweets_tfidf = tfidf_vectorizer(tweets['tweet'])
tweets_tfidf

<24783x4522 sparse matrix of type '<class 'numpy.float64'>'
	with 146734 stored elements in Compressed Sparse Row format>

## 2.4 TFIDF with N-grams

This is similar to the regular TFIDF configuration. However, this configuration is to obtain some n-grams. The n-grams are sets of words that could appear nearly. The criteria for choosing n-grams is going to be the same as the first configuration.

In [None]:
def tfidf_ngram_vectorizer(matrix):
  vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=4, lowercase=False, stop_words='english')
  return vectorizer.fit_transform(matrix)

In [None]:
tweets_ngrams = tfidf_ngram_vectorizer(tweets['tweet'])
tweets_ngrams

<24783x8048 sparse matrix of type '<class 'numpy.float64'>'
	with 183087 stored elements in Compressed Sparse Row format>

## 2.5 TFIDF with N-grams and POS-tagging

This consists in tokenize the tweets to words, and obtain the pos-tags that belongs to each word. This pos-tags are going to be vectorized the same as the other vectorizations, but instead of having tweets and words, we are going to have pos tags.


In [None]:
def tokenize_pos(raw):
  tokens = word_tokenize(raw)
  tags = nltk.pos_tag(tokens)
  raw_tags = " ".join(tag for (word, tag) in tags)
  return raw_tags

In [None]:
tokens_tweet = tweets['tweet'].apply(tokenize_pos)
tokens_tweet.head()

0          NNS NN VBP NN NN NN RB VBP NN
1       NN NN NN NN NN JJ NN NN NN CD NN
2        NN NN NN RB VBD JJ NN NN VBD NN
3                            JJ NN IN NN
4    NN VBD JJ MD JJ MD VB NN VBD CD NNS
Name: tweet, dtype: object

In [None]:
tweets_ngrams_pos = tfidf_ngram_vectorizer(tokens_tweet)
tweets_ngrams_pos

<24783x381 sparse matrix of type '<class 'numpy.float64'>'
	with 221389 stored elements in Compressed Sparse Row format>

## 2.6 TFIDF with N-grams, POS-tagging and other features.

### 2.6.1 Number of RT's

Another features is going to be the number of RT's words that a tweet had.

In [None]:
def num_rt(raw):
  rt = 0
  words = raw.split()
  for word in words:
    if word == 'rt':
      rt += 1
  return rt

In [None]:
tweets_rt = tweets['tweet'].apply(num_rt)
tweets_rt.head()

0    1
1    1
2    2
3    1
4    1
Name: tweet, dtype: int64

In [None]:
tweets_rt = np.reshape(tweets_rt.to_list(), (len(tweets_rt), 1))
tweets_rt

array([[1],
       [1],
       [2],
       ...,
       [0],
       [0],
       [0]])

In [None]:
tweets_rt = sp.csr_matrix(tweets_rt)
tweets_rt

<24783x1 sparse matrix of type '<class 'numpy.longlong'>'
	with 7130 stored elements in Compressed Sparse Row format>

### 2.6.2 Sentiment Analysis

Also, the tweets are going to analyze and treated by a sentiment analyzer, that is going to give us 4 different feautures representing different sentiments about that tweet.

In [None]:
def sentiment_analysis(raw):
  sentiment_analyzer  = SentimentIntensityAnalyzer() 
  sentiment = sentiment_analyzer.polarity_scores(raw)
  return [value for key, value in sentiment.items()]

In [None]:
tweets_sentiment = tweets['tweet'].apply(sentiment_analysis)
tweets_sentiment.head()

0     [0.238, 0.762, 0.0, -0.3612]
1     [0.259, 0.741, 0.0, -0.5423]
2      [0.765, 0.235, 0.0, -0.946]
3      [0.0, 0.545, 0.455, 0.3612]
4    [0.43, 0.407, 0.163, -0.6808]
Name: tweet, dtype: object

In [None]:
tweets_sentiment = sp.csr_matrix(tweets_sentiment.to_list())
tweets_sentiment

<24783x4 sparse matrix of type '<class 'numpy.float64'>'
	with 72757 stored elements in Compressed Sparse Row format>

### 2.6.3 Hatred N-gram dictionary

Lastly, with the hatred n-gram dictionary, it is going to be compared the words in this dictionary withe the words of the tweets. If a word of this dictionary appears in a tweet, a weight is going to be associated to it.

In [None]:
hatred_dictionary = upload_dataframes([])

Saving refined_ngram_dict.csv to refined_ngram_dict (1).csv
User uploaded file "refined_ngram_dict.csv" with length 3178 bytes


In [None]:
hatred_dictionary.head()

Unnamed: 0,ngram,prophate
0,allah akbar,0.87
1,blacks,0.583
2,chink,0.467
3,chinks,0.542
4,dykes,0.602


In [None]:
def hatred_dict_analysis(raw):
  weights = []
  for i in hatred_dictionary.index:
    if re.search(hatred_dictionary['ngram'][i], raw) is None:
      weights.append(0.0)
    else:
      weights.append(hatred_dictionary['prophate'][i])
  return weights

In [None]:
tweets_hatred_dict = tweets['tweet'].apply(hatred_dict_analysis)
tweets_hatred_dict.head()

0    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Name: tweet, dtype: object

In [None]:
tweets_hatred_dict = sp.csr_matrix(tweets_hatred_dict.to_list())
tweets_hatred_dict

<24783x178 sparse matrix of type '<class 'numpy.float64'>'
	with 1367 stored elements in Compressed Sparse Row format>

###2.7 Unify all configurations

The configurations used are going to be the next:


*   TFIDF with n-grams
*   TFIDF with n-grams of POS tagging
*   Number of RT's
*   Sentiments of tweets
*   Hatred words

TFIDF without n-grams because it is implictly incorporated in the vectorization with n-grams yet.



In [None]:
configurations = sp.hstack((tweets_ngrams, tweets_ngrams_pos,
                            tweets_rt, tweets_sentiment, tweets_hatred_dict), format='csr')
configurations

<24783x8612 sparse matrix of type '<class 'numpy.float64'>'
	with 485730 stored elements in Compressed Sparse Row format>

After concatenate all the configuration matrices, one thing needed to do is to remove columns that have zeros in all the entries.

In [None]:
correct_columns = np.where(configurations.sum(axis=0)!=0)[1]
configurations = configurations[:,correct_columns]
configurations

<24783x8494 sparse matrix of type '<class 'numpy.float64'>'
	with 485730 stored elements in Compressed Sparse Row format>

On the last hand, it is necessary to save the scr matrix in a file with an extension **.npz** that is going to be used to select the features.

In [None]:
sp.save_npz('/drive/My Drive/vectorized_tweets.npz', configurations)