# Text classification using TF-IDF

### 1. Load the dataset from sklearn.datasets

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

### 2. Training data

In [3]:
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

In [18]:
train_features = twenty_train.data

### 3. Test data

In [4]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

In [19]:
test_features = twenty_test.data

###  a.  You can access the values for the target variable using .target attribute 

In [22]:
train_target = twenty_train.target
train_target

array([1, 1, 3, ..., 2, 2, 2], dtype=int64)

In [23]:
test_target = twenty_test.target
test_target

array([2, 2, 2, ..., 2, 2, 1], dtype=int64)

###  b. You can access the name of the class in the target variable with .target_names

In [6]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [7]:
twenty_train.data[0:5]

['From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format.  We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance.  Michael.\n-- \nMichael Collier (Programmer)                 The Computer Unit,\nEmail: M.P.Collier@uk.ac.city                The City University,\nTel: 071 477-8000 x3769                      London,\nFax: 071 477-8565                            EC1V 0HB.\n',
 "From: ani@ms.uky.edu (Aniruddha B. Deglurkar)\nSubject: help: Splitting a trimming region along a mesh \nOrganization: University Of Kentucky, Dept. of Math Sciences\nLines: 28\n\n\n\n\tHi,\n\n\tI have a problem, I hope some of the 'gurus' can help me solve.\n\n\tBackground of the probl

### 4.  Now with dependent and independent data available for both train and test datasets, using TfidfVectorizer fit and transform the training data and test data and get the tfidf features for both

Hint: Use ".fit_transform" on Train set and ".transform" on Test set

In [84]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [85]:
tf_idf = TfidfVectorizer( max_features=5000)

In [86]:
twenty_train_vec = tf_idf.fit_transform(twenty_train.data)
twenty_train_vec.shape

(2257, 5000)

In [87]:
twenty_test_vec = tf_idf.transform(twenty_test.data)
twenty_test_vec.shape

(1502, 5000)

### 5. Use logisticRegression with tfidf features as input and targets as output and train the model and report the train and test accuracy score

In [88]:
from sklearn.linear_model import LogisticRegression

In [89]:
LR = LogisticRegression()

In [90]:
LR.fit(twenty_train_vec,train_target)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [91]:
y_pred_train = LR.predict(twenty_train_vec)

In [92]:
# make class predictions 
y_pred_test = LR.predict(twenty_test_vec)

In [93]:
y_pred_test

array([2, 2, 2, ..., 2, 2, 1], dtype=int64)

In [94]:
from sklearn.metrics import accuracy_score

In [95]:
print("The train accuracy is",accuracy_score(train_target,y_pred_train ))

The train accuracy is 0.9902525476295968


In [96]:
print("The test accuracy is",accuracy_score(test_target,y_pred_class ))

The test accuracy is 0.9007989347536618


## Sentiment analysis <br> 

The objective of this problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 6. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

Hint: pd.read_csv('./tweets.csv',encoding = "ISO-8859-1").dropna()

In [98]:
import pandas as pd
data = pd.read_csv("tweets.csv",encoding = "ISO-8859-1").dropna()

In [99]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [100]:
data.isna().sum().sum()

0

### 7. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [101]:
import string, re
from nltk import word_tokenize 
def preprocess(text):
    try:
        # Check characters to see if they are in punctuation
        nopunc = [char for char in text if char not in string.punctuation]
        # Join the characters again to form the string.
        nopunc = ''.join(nopunc)
        # convert text to lower-case
        nopunc = nopunc.lower()
        # remove URLs
        nopunc = re.sub('((www\.[^\s]+)|(https?://[^\s]+)|(http?://[^\s]+))', '', nopunc)
        nopunc = re.sub(r'http\S+', '', nopunc)
        # remove usernames
        nopunc = re.sub('@[^\s]+', '', nopunc)
        # remove the # in #hashtag
        nopunc = re.sub(r'#([^\s]+)', r'\1', nopunc)
        return ''.join(nopunc)
    except Exception as e:
        return ""

In [102]:
data['text'] = [preprocess(text) for text in data.tweet_text]

In [105]:
data[['text']]

Unnamed: 0,text
0,wesley83 i have a 3g iphone after 3 hrs tweeti...
1,jessedee know about fludapp awesome ipadiphon...
2,swonderlin can not wait for ipad 2 also they s...
3,sxsw i hope this years festival isnt as crashy...
4,sxtxstate great stuff on fri sxsw marissa maye...
...,...
9077,mention your pr guy just convinced me to switc...
9079,quotpapyrussort of like the ipadquot nice lol...
9080,diller says google tv quotmight be run over by...
9085,ive always used camera for my iphone bc it has...


### 8. Consider only rows having a Positive or Negative emotion and remove other rows from the dataframe.

Hint: Use df = df[(df["col_name"] == "Positive emotion") OR (df["col_name"] == "Negative emotion")]

In [107]:
data = data[(data["is_there_an_emotion_directed_at_a_brand_or_product"] == "Positive emotion") |
            (data["is_there_an_emotion_directed_at_a_brand_or_product"] == "Negative emotion")]

In [108]:
data['is_there_an_emotion_directed_at_a_brand_or_product'].unique()

array(['Negative emotion', 'Positive emotion'], dtype=object)

### 9. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

Hint: Perfrom fit (".fit") and transformation(".transform") for whole data, later will do CountVectorizer "fit_transform" and "transform" for train and test separately 

In [123]:
from sklearn.feature_extraction.text import CountVectorizer

In [124]:
vect = CountVectorizer()

In [133]:
vect.fit_transform(data['text'])

<3191x6110 sparse matrix of type '<class 'numpy.int64'>'
	with 52504 stored elements in Compressed Sparse Row format>

### 10. Find number of different words in vocabulary

 #### Tip: To see all available functions for an Object use dir and use appropriate function to find number of different words in vocab

In [134]:
dir(vect)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_sort_features',
 '_stop_words_id',
 '_validate_custom_analyzer',
 '_validate_params',
 '_validate_vocabulary',
 '_warn_for_unused_params',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'build_tokenizer',
 'decode',
 'decode_error',
 'dtype',
 'encoding',
 'fit',
 'fit_transform',
 'fixed_vocabulary_',
 'get_feature_names',
 'get_params',
 'get_stop_words'

In [135]:
vect.vocabulary_

{'wesley83': 5847,
 'have': 2415,
 '3g': 101,
 'iphone': 2795,
 'after': 258,
 'hrs': 2570,
 'tweeting': 5561,
 'at': 470,
 'riseaustin': 4493,
 'it': 2828,
 'was': 5801,
 'dead': 1386,
 'need': 3497,
 'to': 5425,
 'upgrade': 5651,
 'plugin': 3934,
 'stations': 4984,
 'sxsw': 5151,
 'jessedee': 2859,
 'know': 2950,
 'about': 183,
 'fludapp': 2012,
 'awesome': 534,
 'ipadiphone': 2782,
 'app': 388,
 'that': 5315,
 'youll': 6040,
 'likely': 3073,
 'appreciate': 418,
 'for': 2041,
 'its': 2831,
 'design': 1448,
 'also': 314,
 'theyre': 5346,
 'giving': 2215,
 'free': 2074,
 'ts': 5529,
 'swonderlin': 5142,
 'can': 878,
 'not': 3567,
 'wait': 5771,
 'ipad': 2775,
 'they': 5343,
 'should': 4739,
 'sale': 4551,
 'them': 5327,
 'down': 1604,
 'hope': 2542,
 'this': 5363,
 'years': 6020,
 'festival': 1945,
 'isnt': 2822,
 'as': 459,
 'crashy': 1274,
 'sxtxstate': 5190,
 'great': 2298,
 'stuff': 5060,
 'on': 3651,
 'fri': 2080,
 'marissa': 3254,
 'mayer': 3283,
 'google': 2255,
 'tim': 5397,
 '

### 11. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [136]:
data['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Positive emotion    2672
Negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 12. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [158]:
data["label"] = data['is_there_an_emotion_directed_at_a_brand_or_product'].map(lambda x: 0 if x == 'Negative emotion' else 1)

### 13. Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets and display shapes

In [159]:
from sklearn.model_selection import train_test_split
x = data['text']
y = data['label']

In [160]:
y.value_counts()

1    2672
0     519
Name: label, dtype: int64

In [161]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=2)

In [162]:
x_train.shape

(2393,)

In [163]:
x_test.shape

(798,)

## 14. **Predicting the sentiment:**


### Use (i) Naive Bayes and (ii) Logistic Regression and print their accuracy scores for predicting the sentiment of the given text

In [164]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [165]:
vect = CountVectorizer()

# create document-term matrices
x_train_dtm = vect.fit_transform(x_train)
x_test_dtm = vect.transform(x_test)

In [166]:
NB=MultinomialNB()

In [167]:
NB.fit(x_train_dtm,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [168]:
y_pred = NB.predict(x_test_dtm)

In [169]:
NB.score(x_train_dtm,y_train)

0.9494358545758462

In [170]:
metrics.accuracy_score(y_test,y_pred)

0.8646616541353384

### 15. Create a function called `tokenize_predict` which can take count vectorizer object as input, create document term matrix out of x_train & x_test, build and train a model using dtm created and print the accuracy 

In [171]:
vect = CountVectorizer()
def tokenize_predict(vect):
    x_train_dtm = vect.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

### 16. Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

Hint: vect = CountVectorizer(ngram_range=(1, 2))

In [172]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_predict(vect)

Features:  25184
Accuracy:  0.8796992481203008


### 17. Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [173]:
# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_predict(vect)

Features:  5023
Accuracy:  0.8709273182957393


### 18. Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [175]:
# remove English stop words and only keep 300 features
vect = CountVectorizer(stop_words = 'english',max_features =300)
tokenize_predict(vect)

Features:  300
Accuracy:  0.8358395989974937


### 19. Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [176]:
# include 1-grams and 2-grams, and limit the number of features to 15000
vect = CountVectorizer(ngram_range=(1, 2),max_features =15000)
tokenize_predict(vect)

Features:  15000
Accuracy:  0.8759398496240601


### 20. Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [177]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times
vect = CountVectorizer(ngram_range=(1, 2),min_df =2)
tokenize_predict(vect)

Features:  7867
Accuracy:  0.8784461152882206
