# Text Classification using Scikit-Learn

### 1. Load the dataset (with sentiment)

In [10]:
import pandas as pd

In [11]:
pd.set_option('display.max_colwidth', None)

In [12]:
df = pd.read_pickle('movie_reviews_with_sentiment.pkl')
df.head(2)

Unnamed: 0,movie_title,rating,genre,in_theaters_date,movie_info,directors,director_gender,tomatometer_rating,audience_rating,critics_consensus,sentiment
0,A Dog's Journey,PG,"Drama, Kids & Family",5/17/19,"Bailey (voiced again by Josh Gad) is living the good life on the Michigan farm of his ""boy,"" Ethan (Dennis Quaid) and Ethan's wife Hannah (Marg Helgenberger). He even has a new playmate: Ethan and Hannah's baby granddaughter, CJ. The problem is that CJ's mom, Gloria (Betty Gilpin), decides to take CJ away. As Bailey's soul prepares to leave this life for a new one, he makes a promise to Ethan to find CJ and protect her at any cost. Thus begins Bailey's adventure through multiple lives filled with love, friendship and devotion as he, CJ (Kathryn Prescott), and CJ's best friend Trent (Henry Lau) experience joy and heartbreak, music and laughter, and few really good belly rubs.",Gail Mancuso,female,50,92,"A Dog's Journey is as sentimental as one might expect, but even cynical viewers may find their ability to resist shedding a tear stretched to the puppermost limit.",0.9837
1,A Dog's Way Home,PG,Drama,1/11/19,"Separated from her owner, a dog sets off on an 400-mile journey to get back to the safety and security of the place she calls home. Along the way, she meets a series of new friends and manages to bring a little bit of comfort and joy to their lives.",Charles Martin Smith,male,60,71,"A Dog's Way Home may not quite be a family-friendly animal drama fan's best friend, but this canine adventure is no less heartwarming for its familiarity.",0.9237


### 2. Clean & Normalize the data

In [13]:
import clean_and_normalize_text as cn

In [17]:
df['movie_info_clean'] = cn.clean_and_normalize(df.movie_info)
temp_df = df[['movie_title', 'movie_info', 'director_gender', 'sentiment', 'movie_info_clean']]
temp_df.head(2)

Unnamed: 0,movie_title,movie_info,director_gender,sentiment,movie_info_clean
0,A Dog's Journey,"Bailey (voiced again by Josh Gad) is living the good life on the Michigan farm of his ""boy,"" Ethan (Dennis Quaid) and Ethan's wife Hannah (Marg Helgenberger). He even has a new playmate: Ethan and Hannah's baby granddaughter, CJ. The problem is that CJ's mom, Gloria (Betty Gilpin), decides to take CJ away. As Bailey's soul prepares to leave this life for a new one, he makes a promise to Ethan to find CJ and protect her at any cost. Thus begins Bailey's adventure through multiple lives filled with love, friendship and devotion as he, CJ (Kathryn Prescott), and CJ's best friend Trent (Henry Lau) experience joy and heartbreak, music and laughter, and few really good belly rubs.",female,0.9837,bailey voice josh gad live good life michigan farm boy ethan dennis quaid ethans wife hannah marg helgenberger new playmate ethan hannah baby granddaughter cj problem cjs mom gloria betty gilpin decide cj away bailey soul prepare leave life new make promise ethan find cj protect cost begin bailey adventure multiple life fill love friendship devotion cj kathryn prescott cjs good friend trent henry lau experience joy heartbreak music laughter good belly rub
1,A Dog's Way Home,"Separated from her owner, a dog sets off on an 400-mile journey to get back to the safety and security of the place she calls home. Along the way, she meets a series of new friends and manages to bring a little bit of comfort and joy to their lives.",male,0.9237,separate owner dog set 400mile journey safety security place call home way meet series new friend manage bring little bit comfort joy life


### 3. Vectorize the data

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english', min_df=.1)

In [30]:
X = cv.fit_transform(temp_df.movie_info_clean)
X

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 572 stored elements and shape (166, 22)>

In [31]:
X_cv_df = pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())
X_cv_df

Unnamed: 0,begin,discover,family,film,follow,force,friend,home,leave,life,...,man,new,set,star,story,turn,woman,world,year,young
0,1,0,0,0,0,0,1,0,1,3,...,0,2,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,1,0,1,...,0,1,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
162,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
163,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,0,1,0,0,1
164,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,1,0,0


In [32]:
Y_cv = temp_df.director_gender
Y_cv

0      female
1        male
2        male
3      female
4      female
        ...  
161      male
162      male
163      male
164    female
165      male
Name: director_gender, Length: 166, dtype: object