<a href="https://colab.research.google.com/github/taylorvroman09/Taylor-Public/blob/main/Project_2_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project \#2 Starter Code
Your project should address the categories below. 

## Problem:



State the problem you are trying to solve with this machine learning experiment. Include a description of the data, and what you're trying to predict. What are the possible uses for this kind of machine learning model?

We are trying to determine the sentiment of the movie reviews based on the words in the review itself. The data contains 50,000 reviews with one column, "review," containing the review itself and another column, "sentiment," that describes the review as "positive" or "negative."

# Input Pipeline (sklearn):

In [3]:
from google.colab import drive
import pandas
drive.mount('/content/drive')
data = pandas.read_csv('/content/drive/MyDrive/MachineLearning/IMDB_dataset.csv')
data.head()

Mounted at /content/drive


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Data Exploration:
- Number of samples
- Number of classes of the target variable
- Number of words per sample
- Distribution of sample length
- Something else: get creative :) 

In [4]:
## Use cells here to explore the data:
##number of samples
print("Number of samples: ", len(data))
print("Number of classes of the target variable: ", len(data.sentiment.unique()))
dm = data['review'].str.split().apply(len).value_counts()
print("Number of words per sample: ", dm.median())



Number of samples:  50000
Number of classes of the target variable:  2
Number of words per sample:  15.0


## Data Preparation

I'm providing you with code that cleans the reviews by making it all lowercase letters and removing stop words. The three cells below do this for you. I still want you to explain what you did with the data here. 

word embeddings, stop words, vectorization, tokeniztion,

In [5]:
from bs4 import BeautifulSoup
import re
import nltk
#only do next line once
nltk.download() #in Corpora tab, download stopwords
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
#The NLTK downloader will open, you need to select (d) for Download, and then 'stopwords'then (q) to quit

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> stopwords
    Downloading package stopwords to /root/nltk_data...
      Unzipping corpora/stopwords.zip.

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


In [6]:
#This is a function that takes in a review, makes sure it is only lower case letters and removes stopwords.
#It returns the cleaned review text.
def clean_review(review):
    #input is a string review
    #return is review cleaned of all punctuation, lowercase, and removed nltk stopwords
    letters_only = re.sub("[^a-zA-Z]"," ",review)
    lower_case = letters_only.lower()
    words = lower_case.split()
    for stop_word in stopwords.words("english"):
        while stop_word in words:
            words.remove(stop_word)
    cleaned = " ".join(words)
    return cleaned

In [7]:
#process the data
cleaned_text = []
for i in range(len(data)):
    cleaned_text.append(clean_review(data["review"][i]))  

In [8]:
cleaned_text[:5]

['one reviewers mentioned watching oz episode hooked right exactly happened br br first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use word br br called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home many aryans muslims gangstas latinos christians italians irish scuffles death stares dodgy dealings shady agreements never far away br br would say main appeal show due fact goes shows dare forget pretty pictures painted mainstream audiences forget charm forget romance oz mess around first episode ever saw struck nasty surreal say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards sold nickel inmates kill order get away well mannered middle class inmates turned prison bitches du

In [9]:
#establish training and testing dataset
train_data, test_data, train_sln, test_sln = \
    train_test_split(cleaned_text, data['sentiment'], test_size = 0.2, random_state=0) 

### Vectorizing the data

**CountVectorizer**: Convert a collection of text documents to a matrix of token counts

In [10]:
from sklearn.feature_extraction.text import CountVectorizer 

#Bag of Words with 5000 most common words
vectorizer = CountVectorizer(analyzer='word', max_features = 50)
#find the right 5000 words
vectorizer.fit(train_data)

#use the vectorizer to transform review strings into word count vectors 
train_data_vectors = vectorizer.transform(train_data).toarray()
test_data_vectors = vectorizer.transform(test_data).toarray()

Support Vector Classifer
Principal Component Analysis
Perceptron
MLP

*Metrics*

- What metrics will you use to evaluate your model? Why are these metrics the best for your model? (Hint, this should be more than 'accuracy').

- I will use a Support Vector Classifier, Principal Component Analysis, a Perceptron, and an MLP to determine which approach most effectively evaluates my model. By using the accuracy of *each* 

In [12]:
## Now use train_data_vectors and test_data_vectors to train/test/tune your sklearn models.
##SVC
from sklearn.svm import SVC
from sklearn import metrics
clf = SVC(kernel = 'linear', degree = 5)
clf.fit(train_data_vectors,train_sln)
predictions = clf.predict(test_data_vectors)
print("accuracy:", metrics.accuracy_score(test_sln, predictions))

accuracy: 0.7047


In [18]:
##look up different things to tune
from sklearn.linear_model import Perceptron
perc = Perceptron(penalty = 'elasticnet')
perc.fit(train_data_vectors,train_sln)
p_predictions = perc.predict(test_data_vectors)

#output accuracy
print("Accuracy:", metrics.accuracy_score(test_sln, p_predictions))

Accuracy: 0.662


In [34]:
from sklearn.decomposition import PCA
extractor = PCA(n_components=2, whiten=True)
extractor.fit(train_data_vectors)
print('this is the variance/importance of each component')
print(extractor.explained_variance_ratio_)

this is the variance/importance of each component
[0.46335543 0.09410852]


In [37]:
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix

mlp = MLPClassifier(random_state=0,hidden_layer_sizes = (100,), max_iter = 800)
mlp.fit(train_data_vectors,train_sln)
predictions = mlp.predict(test_data_vectors)

print("Accuracy: ", metrics.accuracy_score(test_sln,predictions))



Accuracy:  0.6715
