# Exercise with Natural Language Processing

For todays exersice we will be doing two things.  The first is to build the same model with the same data that we did in the lecture, the second will be to build a new model with new data. 

## PART 1: 
- 20 Newsgroups Corpus


## PART 2:
- Republican vs Democrat Tweet Classifier

In [83]:
# Import pandas for data handling
import pandas as pd

# NLTK is our Natural-Language-Took-Kit
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Libraries for helping us with strings
import string
# Regular Expression Library
import re

# Import our text vectorizers
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


# Import our classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier


# Import some ML helper function
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report


# Import our metrics to evaluate our model
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score, roc_auc_score 

#helper function for hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# Library for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# You may need to download these from nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stopwords = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\zeeha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\zeeha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\zeeha\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Load and display data.
1. Load the 20-newsgroups.csv data into a dataframe.
1. Print the shape
1. Inspect / remove nulls and duplicates
1. Find class balances, print out how many of each topic_category there are.

In [None]:
# 1. Load the 20-newsgroups.csv data into a dataframe.
# 2. Print the shape
df = pd.read_csv('data/20-newsgroups.csv')
df.shape


In [None]:
# 3. Inspect / remove nulls and duplicates
df.isnull().sum()

In [None]:
# 4. Find class balances, print out how many of each topic_category there are.
df['topic_category'].value_counts(normalize = True)

# Text Pre-Processing 
(aka Feature engineering)
1. Make a function that makes all text lowercase.
    * Do a sanity check by feeding in a test sentence into the function. 
    
    
2. Make a function that removes all punctuation. 
    * Do a sanity check by feeding in a test sentence into the function. 
    
    
3. Make a function that removes all stopwords.
    * Do a sanity check by feeding in a test sentence into the function. 
    
    
4. EXTRA CREDIT (This step only): Make a function that stemms all words. 


5. Mandatory: Make a pipeline function that applys all the text processing functions you just built.
    * Do a sanity check by feeding in a test sentence into the pipeline. 
    
    
    
6. Mandatory: Use `df['message_clean'] = df[column].apply(???)` and apply the text pipeline to your text data column. 

In [13]:
# 1. Make a function that makes all text lowercase.
def make_lower(a_string):
    return a_string.lower()
test_string = 'This is A SENTENCE with LOTS OF CAPS.'
make_lower(test_string)

'this is a sentence with lots of caps.'

In [14]:
# 2. Make a function that removes all punctuation. 
# Remove all punctuation
def remove_punctuation(a_string):    
    a_string = re.sub(r'[^\w\s]','',a_string)
    return a_string


a_sentence = 'This is a sentence! 50 With lots of punctuation??? & other #things.'
remove_punctuation(a_sentence)


'This is a sentence 50 With lots of punctuation  other things'

In [15]:
# 3. Make a function that removes all stopwords.
# Remove all stopwords

def remove_stopwords(a_string):
    # Break the sentence down into a list of words
    words = word_tokenize(a_string)
    
    # Make a list to append valid words into
    valid_words = []
    
    # Loop through all the words
    for word in words:
        
        # Check if word is not in stopwords
        if word not in stopwords:
            
            # If word not in stopwords, append to our valid_words
            valid_words.append(word)

    # Join the list of words together into a string
    a_string = ' '.join(valid_words)

    return a_string
            
a_sentence = 'This is a sentence! With some different stopwords i have added in here.'
remove_stopwords(a_sentence)


'This sentence ! With different stopwords added .'

In [None]:
# 4. EXTRA CREDIT: Make a function that stemms all words. 

test_string = 'I played and started playing with players and we all love to play with plays'



In [16]:
# 5. MANDATORY: Make a pipeline function that applys all the text processing functions you just built.

def text_pipeline(input_string):
    input_string = make_lower(input_string)
    input_string = remove_punctuation(input_string)
    #input_string = lem_with_pos_tag(input_string)
    input_string = remove_stopwords(input_string)    
    return input_string

test_string = 'I played and started playing with players and we all love to play with plays'

text_pipeline(test_string)

'played started playing players love play plays'

In [None]:
# 6. Mandatory: Use `df[column].apply(???)` and apply the text pipeline to your text data column. 
df['clean_message'] = df['message'].apply(text_pipeline)

# Text Vectorization

1. Define your `X` and `y` data. 


2. Initialize a vectorizer (you can use TFIDF or BOW, it is your choice).
    * Do you want to use n-grams..?


3. Fit your vectorizer using your X data.
    * Remember, this process happens IN PLACE.


4. Transform your X data using your fitted vectorizer. 
    * `X = vectorizer.???`



5. Print the shape of your X.  How many features (aka columns) do you have?

In [None]:
# 1. Define your `X` and `y` data. 
X = df['clean_message'].values
y = df['topic_category'].values

In [None]:
# Split our data into testing and training like always. 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)


# Save the raw text for later just incase
X_train_text = X_train
X_test_text = X_test

In [None]:
# 2. Initialize a vectorizer (you can use TFIDF or BOW, it is your choice).
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train)

X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

In [None]:
# 5. Print the shape of your X.  How many features (aka columns) do you have?
X_train.shape

# Split your data into Training and Testing data. 

___
# Build and Train Model
Use Multinomial Naive Bayes to classify these documents. 

1. Initalize an empty model. 
2. Fit the model with our training data.


Experiment with different alphas.  Use the alpha gives you the best result.

EXTRA CREDIT:  Use grid search to programmatically do this for you. 

In [None]:
# 1. Initalize an empty model. 
model = MultinomialNB(alpha = 0.05)

In [None]:
# Fit our model with our training data.
model.fit(X_train, y_train)

# Evaluate the model.

1. Make new predicitions using our test data. 
2. Print the accuracy of the model. 
3. Print the confusion matrix of our predictions. 
4. Using `classification_report` print the evaluation results for all the classes. 



In [None]:
# 1. Make new predictions of our testing data. 
y_pred = model.predict(X_test)

y_pred_proba = model.predict_proba(X_test)

In [None]:
# 2. Print the accuracy of the model. 
accuracy = model.score(X_test, y_test)

print("Model Accuracy: %f" % accuracy)

In [None]:
params = {
    'alpha' : [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100,1000, 'auto']
         }
grid_search_cv = GridSearchCV(
    estimator=MultinomialNB(),
    param_grid = params, 
    scoring = 'accuracy')

grid_search_cv.fit(X_train, y_train)

print(grid_search_cv.best_params_)

model = grid_search_cv.best_estimator_


In [None]:
accuracy = model.score(X_test, y_test)

print("Model Accuracy: %f" % accuracy)

In [None]:
# 3. Plot the confusion matrix of our predictions
fig, ax = plt.subplots(figsize=(21,21))

disp = plot_confusion_matrix(model, X_test, y_test, display_labels=model.classes_,cmap=plt.cm.Blues, ax = ax)

plt.xticks(rotation=90)
disp

In [None]:
# 4. Using `classification_report` print the evaluation results for all the classes. 
print(classification_report(y_test, y_pred, target_names=model.classes_))

# Manual predicition
Write a new sentence that you think will be classified as talk.politics.guns. 
1. Apply the text pipeline to your sentence
2. Transform your cleaned text using the `X = vectorizer.transform([your_text])`
    * Note, the `transform` function accepts a list and not a individual string.
3. Use the model to predict your new `X`. 
4. Print the prediction

In [None]:
my_sentence = "Dick Metcalf has been shooting since kindergarten, a member of the NRA since middle school He’s been studying, writing, and teaching about firearms for over 40 years. But Metcalf’s long career as a columnist with Guns & Ammo magazine came to an abrupt halt in late 2013 after he penned a column that explored the line between firearm regulation and Second Amendment infringement. The backlash was quick and complete from what Metcalf calls the “radical extremists” of the firearms movement, a movement he’s always considered himself a part of. Join Dick Metcalf and The Atlantic’s Ron Brownstein for a conversation about guns, culture, and the future of firearms in America."

# 1. Apply the text pipeline to your sentence
my_sentence = text_pipeline(my_sentence)
# 2. Transform your cleaned text using the `X = vectorizer.transform([your_text])`\
new_X = vectorizer.transform([my_sentence])
# 3. Use the model to predict your new `X`. 
model.predict(new_X)
# 4. Print the prediction


___
# PART 2: Twitter Data
This part of the exercise is un-guided on purpose.  

Using the `dem-vs-rep-tweets.csv` build a classifier to determine if a tweet was written by a democrat or republican. 

Can you get an f1-score higher than %82

In [62]:
# 1. Load the 20-newsgroups.csv data into a dataframe.
# 2. Print the shape
df = pd.read_csv('data/dem-vs-rep-tweets.csv')

In [63]:
df.shape

(86460, 3)

In [64]:
df.head()

Unnamed: 0,Party,Handle,Tweet
0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P..."
1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...
2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...


In [65]:
df.isnull().sum()

Party     0
Handle    0
Tweet     0
dtype: int64

In [66]:
df.duplicated().sum()

57

In [67]:
df = df.drop_duplicates()

In [68]:
df.duplicated().sum()

0

In [69]:
df['Party'].value_counts()

Republican    44362
Democrat      42041
Name: Party, dtype: int64

In [70]:
df['cleaned_tweets'] = df['Tweet'].apply(text_pipeline)

In [71]:
print("ORIGINAL TEXT:", df['Tweet'][0])
print("CLEANDED TEXT:", df['cleaned_tweets'][0])

ORIGINAL TEXT: Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House… https://t.co/n3tggDLU1L
CLEANDED TEXT: today senate dems vote savetheinternet proud support similar netneutrality legislation house httpstcon3tggdlu1l


In [72]:
df = pd.get_dummies(df, columns=['Party'], drop_first=True)

In [73]:
df['Party_Republican'].value_counts()

1    44362
0    42041
Name: Party_Republican, dtype: int64

In [74]:
X = df['cleaned_tweets'].values
y = df['Party_Republican'].values

In [75]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

In [76]:
# Initialize our vectorizer
vectorizer = TfidfVectorizer()

# This makes your vocab matrix
vectorizer.fit(X_train)

# This transforms your documents into vectors.
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

print(X_train.shape, type(X))

(69122, 112403) <class 'numpy.ndarray'>


In [165]:
model = MultinomialNB(alpha = 0.455)

In [166]:
model.fit(X_train, y_train)

In [167]:
y_pred = model.predict(X_test)

In [168]:
accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
print("Accuracy Score: %f" % accuracy)

precision = precision_score(y_true=y_test, y_pred=y_pred)
print("Precision Score: %f" % precision)

f1 = f1_score(y_true=y_test, y_pred=y_pred)
print('F1 Score: %f' % f1)

recall = recall_score(y_true=y_test, y_pred=y_pred)
print("Recall Score: %f" % recall)

Accuracy Score: 0.811469
Precision Score: 0.797269
F1 Score: 0.822162
Recall Score: 0.848659


In [26]:
print(classification_report(y_test, y_pred, target_names=model.classes_))

              precision    recall  f1-score   support

    Democrat       0.82      0.79      0.80      8407
  Republican       0.81      0.83      0.82      8874

    accuracy                           0.81     17281
   macro avg       0.81      0.81      0.81     17281
weighted avg       0.81      0.81      0.81     17281



In [113]:
params = {
    'alpha' : [0.001, 0.01, 0.05, 0.1, 1, 'auto']
         }
grid_search_cv = GridSearchCV(
    estimator=MultinomialNB(),
    param_grid = params, 
    scoring = 'recall')

grid_search_cv.fit(X_train, y_train)

print(grid_search_cv.best_params_)

model = grid_search_cv.best_estimator_

{'alpha': 1}


5 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\zeeha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\zeeha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\naive_bayes.py", line 727, in fit
    alpha = self._check_alpha()
  File "C:\Users\zeeha\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\naive_bayes.py", line 580, in _check_alpha
    if np.min(self.alpha) < 0:
  File "<__array_function__ internals>", line 180, in amin
  File "C:\

In [114]:
y_pred = model.predict(X_test)

In [115]:
f1 = f1_score(y_true=y_test, y_pred=y_pred)
print('F1 Score: %f' % f1)

F1 Score: 0.818819


In [116]:
recall = recall_score(y_true=y_test, y_pred=y_pred)
print("Recall Score: %f" % recall)

Recall Score: 0.855082
