# DSCI 614: Project 4

### Symphony Hopkins

## Introduction

We are acting as a data scientist working for a Political Consulting Firm. We were given a dataset containing in Twitter_Data.csv. This dataset has the following two columns:
+ clean_text: Tweets made by the people extracted from Twitter Mainly Focused on tweets Made by People on Modi(2019 Indian Prime Minister candidate) and Other Prime Ministerial Candidates.
+ category: It describes the actual sentiment of the respective tweet with three values of -1, 0, and 1.

We were asked to perform the following steps:

## 1. Load the dataset of Twitter_Data.csv into memory.

Let's load the dataset into memory using the pandas library.

In [1]:
#importing library
import pandas as pd

In [2]:
# retrieving data from csv file and storing it into a dataframe
twitter_data=pd.read_csv('/Users/symphonyhopkins/Documents/Maryville_University/DSCI_614/Week_4/Twitter_Data.csv')
twitter_data.head()

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0


Let's look at the shape of the dataframe.

In [3]:
# displaying shape
twitter_data.shape

(162980, 2)

As we can see, we have a large dataset, with approximately 163,000 rows and 2 columns. We will also check to see if there are missing values. If there are missing values, we will need to address them, otherwise, it will create errors later on. 

In [4]:
# checking for missing values
twitter_data.isna().sum()

clean_text    4
category      7
dtype: int64

We can deal with missing values in multiple ways. For this case, we will simply drop the rows with missing values since it is only a small amount.

In [5]:
# dropping rows with missing values
twitter_data = twitter_data.dropna()

# displaying new shape
twitter_data.shape

(162969, 2)

## 2. Convert the column of the clean_text to a matrix of token counts using CountVectorizer and unigrams and bigrams.

In order to perform text feature extraction, we need to create numerical representations for the texts, so we are going to convert the clean_text column to a matrix of token counts.

In [6]:
# importing libraries
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [7]:
# creating a vectorizer object using 1-grams and 2-grams
vectorizer = CountVectorizer(ngram_range=(1, 2))

# encoding the corpus
# extracting token counts out of raw text documents using the vocabulary
token_count_matrix = vectorizer.fit_transform(twitter_data['clean_text'])

# summarizing the numerical features from texts
print(f'The size of the feature matrix for the texts = {token_count_matrix.get_shape()}')
print(f'The first row of the feature matrix = {token_count_matrix[0, ]}.')
print(f'There are {token_count_matrix[0, ].count_nonzero()}/{token_count_matrix.get_shape()[1]} non-zeros')

The size of the feature matrix for the texts = (162969, 1199719)
The first row of the feature matrix =   (0, 1145433)	1
  (0, 666550)	1
  (0, 831874)	1
  (0, 658436)	1
  (0, 435496)	1
  (0, 644085)	1
  (0, 435144)	1
  (0, 357405)	1
  (0, 480827)	1
  (0, 134189)	1
  (0, 1029267)	2
  (0, 299531)	1
  (0, 554527)	1
  (0, 867035)	1
  (0, 976961)	2
  (0, 1155015)	1
  (0, 308537)	1
  (0, 1006645)	1
  (0, 1183127)	1
  (0, 419831)	1
  (0, 562991)	1
  (0, 940181)	2
  (0, 66073)	3
  (0, 728511)	1
  (0, 175799)	1
  :	:
  (0, 357481)	1
  (0, 481025)	1
  (0, 134240)	1
  (0, 1032232)	1
  (0, 299593)	1
  (0, 555042)	1
  (0, 867040)	1
  (0, 1038732)	1
  (0, 977637)	1
  (0, 1155436)	1
  (0, 309052)	1
  (0, 1007443)	1
  (0, 1183655)	1
  (0, 420540)	1
  (0, 563130)	1
  (0, 977513)	1
  (0, 940255)	1
  (0, 72487)	1
  (0, 729190)	1
  (0, 175815)	1
  (0, 74556)	1
  (0, 940627)	1
  (0, 356855)	1
  (0, 838910)	1
  (0, 75386)	1.
There are 60/1199719 non-zeros


When we account for 1-grams and 2-grams, we can see that we have approximately 120,000 features.

## 3. Perform the tf-idf anlysis on the column of the clean_text using CountVectorizer and TfidfTransformer.

We will now use tf-idf analysis to determine how important each word is to the documents using CountVectorizer and TfidfTransformer.

In [8]:
# importing library
from sklearn.feature_extraction.text import TfidfTransformer

In [9]:
# creating a vectorizer object using the default parameters
vectorizer = CountVectorizer()

# extracting token counts out of raw text documents using the vocabulary
token_count_matrix = vectorizer.fit_transform(twitter_data['clean_text'])

# summarizing token count matrix
print(f'The size of the count matrix for the texts = {token_count_matrix.get_shape()}')
print(f'The sparse count matrix is as follows:')
print(token_count_matrix)

# creating a tf_idf object using the default parameters
tf_idf_transformer=TfidfTransformer(use_idf=True, smooth_idf=True, sublinear_tf=False) 

# fitting to the token_count_matrix, then transforming it to a normalized tf-idf representation
tf_idf_matrix_1 = tf_idf_transformer.fit_transform(token_count_matrix)

# summarizing the tf_idf_matrix
print(f'The size of the tf_idf matrix for the texts = {tf_idf_matrix_1.get_shape()}')
print(f'The sparse tf_idf matrix is as follows:')
print(tf_idf_matrix_1)

The size of the count matrix for the texts = (162969, 106924)
The sparse count matrix is as follows:
  (0, 103779)	1
  (0, 62480)	1
  (0, 76936)	1
  (0, 61636)	1
  (0, 40526)	1
  (0, 60316)	1
  (0, 40498)	1
  (0, 34701)	1
  (0, 43979)	1
  (0, 13684)	1
  (0, 95481)	2
  (0, 29341)	1
  (0, 51356)	1
  (0, 80437)	1
  (0, 91103)	2
  (0, 103993)	1
  (0, 30477)	1
  (0, 93827)	1
  (0, 105520)	1
  (0, 39395)	1
  (0, 51984)	1
  (0, 87791)	2
  (0, 8389)	3
  (0, 67997)	1
  (0, 17907)	1
  :	:
  (162968, 65872)	1
  (162968, 56602)	1
  (162968, 95374)	1
  (162968, 63946)	1
  (162968, 5841)	1
  (162968, 5191)	1
  (162968, 74811)	1
  (162968, 17962)	1
  (162968, 69382)	1
  (162968, 43150)	1
  (162968, 103787)	1
  (162968, 47230)	1
  (162968, 82980)	2
  (162968, 34101)	1
  (162968, 34124)	1
  (162968, 89692)	1
  (162968, 77338)	1
  (162968, 10864)	2
  (162968, 44214)	1
  (162968, 25873)	1
  (162968, 56815)	1
  (162968, 58706)	1
  (162968, 29767)	1
  (162968, 58705)	1
  (162968, 41682)	1
The size of the t

## 4. Perform the tf-idf analysis on the column of the clean_text using Tfidfvectorizer.

We will do the same thing using only the Tfidfvectorizer.

In [10]:
#importing library
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
# creating a TfidfVectorizer Object using the default parameters
tfidf_vectorizer = TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False)

# fitting to the corpus, then converingt a collection of raw documents to a matrix of TF-IDF features.
tf_idf_matrix_2 = tfidf_vectorizer.fit_transform(twitter_data['clean_text'])

# summarizing the tf_idf_matrix
print(f'The size of the tf_idf matrix for the texts = {tf_idf_matrix_2.get_shape()}')
print(f'The sparse tf_idf matrix is as follows:')
print(tf_idf_matrix_2)

The size of the tf_idf matrix for the texts = (162969, 106924)
The sparse tf_idf matrix is as follows:
  (0, 94773)	0.23660485539606377
  (0, 77542)	0.26444447540976607
  (0, 34636)	0.2517654038938212
  (0, 17907)	0.18097710894277283
  (0, 67997)	0.0814332613481619
  (0, 8389)	0.18586937299338827
  (0, 87791)	0.23874126253192132
  (0, 51984)	0.20306582436747234
  (0, 39395)	0.12597723786710188
  (0, 105520)	0.12028708322181941
  (0, 93827)	0.1339988468628054
  (0, 30477)	0.14523926764170483
  (0, 103993)	0.10858917152548613
  (0, 91103)	0.31050659121004387
  (0, 80437)	0.31477404515232055
  (0, 51356)	0.1549548400861804
  (0, 29341)	0.2037990596290857
  (0, 95481)	0.11033764463078546
  (0, 13684)	0.22828354889246916
  (0, 43979)	0.11613310705621377
  (0, 34701)	0.201961493311543
  (0, 40498)	0.19217208728463128
  (0, 60316)	0.21686929453030435
  (0, 40526)	0.12555123559382844
  (0, 61636)	0.18891010271399822
  :	:
  (162968, 10864)	0.33644207883672295
  (162968, 77338)	0.15060798979573

## 5. Perform the tf-idf analysis on the column of the clean_text using HashingVectorizer and TfidfTransformer.

Once again, we will perform tf-idf analysis but only use HashingVectorizer and TfidfTransformer.

In [12]:
#importing library
from sklearn.feature_extraction.text import HashingVectorizer

In [13]:
#creating a HashingVectorizer object using the default parameters
hash_vectorizer = HashingVectorizer()

# converting a collecting of text documents to a matrix token counts using hash vectorizing
token_count_matrix=hash_vectorizer.fit_transform(twitter_data['clean_text'])

# summarizing the count matrix
print(f'The size of the count matrix for the texts = {token_count_matrix.get_shape()}')
print(f'The sparse count matrix is as follows:')
print(token_count_matrix)

# we will use the transformer we created in step 3 since it is already set to the default parameters
# fitting to the count matrix, then transforming it to a normalized tf-idf representation
tf_idf_matrix_3 = tf_idf_transformer.fit_transform(token_count_matrix)

# summarizing the tf_idf_matrix
print(f'The size of the tf_idf matrix for the texts = {tf_idf_matrix_3.get_shape()}')
print(f'The sparse tf_idf matrix is as follows:')
print(tf_idf_matrix_3)

The size of the count matrix for the texts = (162969, 1048576)
The sparse count matrix is as follows:
  (0, 160541)	0.14907119849998599
  (0, 168557)	0.14907119849998599
  (0, 180525)	-0.4472135954999579
  (0, 232512)	0.14907119849998599
  (0, 263274)	0.14907119849998599
  (0, 277794)	-0.14907119849998599
  (0, 286878)	-0.29814239699997197
  (0, 288398)	0.14907119849998599
  (0, 360502)	0.29814239699997197
  (0, 387101)	-0.14907119849998599
  (0, 433698)	0.14907119849998599
  (0, 434864)	0.14907119849998599
  (0, 449993)	-0.14907119849998599
  (0, 465141)	-0.14907119849998599
  (0, 482215)	-0.14907119849998599
  (0, 484920)	-0.14907119849998599
  (0, 490370)	0.29814239699997197
  (0, 522187)	0.14907119849998599
  (0, 614924)	0.14907119849998599
  (0, 646934)	0.14907119849998599
  (0, 747378)	-0.14907119849998599
  (0, 748718)	0.14907119849998599
  (0, 808196)	-0.14907119849998599
  (0, 839641)	-0.14907119849998599
  (0, 865698)	0.14907119849998599
  :	:
  (162968, 257965)	0.16222142113