# Introduction
In Part II, we explored the text data using wordclouds to build an intuition. In this Part, we will first transform our text into a form that is suitable for machine learning modelling. Following that, we will then train a classifier model.

As an overview, we will:
1. Load our libraries
2. Read our text data
3. Clean our data even further
4. Perform td-idf vectorization
5. Train a machine learning model
6. Tune the dataset and retrain

### Step 1: Import libraries
Import the following libraries:
1. pandas as pd
2. STOPWORDS from wordcloud


In [1]:
# Step 1: Import the libraries
import pandas as pd
from wordcloud import STOPWORDS

### Step 2: Read your CSV from Part I
Import the CSV that we exported at the end of Part I. 
Making sure that our resulting DataFrame has 2,286 rows and 3 columns.

In [2]:
# Step 2: Read the CSV
df = pd.read_csv('CleanedMisogyny.csv', usecols = ['Definition', 'is_misogyny', 'cleaned_definition'])
df

Unnamed: 0,Definition,is_misogyny,cleaned_definition
0,Ur gonna die... queer,0.0,ur gonna die queer
1,Valuptuous man boobs.,0.0,valuptuous man boobs
2,Variation of brother.,0.0,variation of brother
3,Very impressive penis,0.0,very impressive penis
4,What I call my penis.,0.0,what i call my penis
...,...,...,...
2281,"A women who is ""easy""",1.0,a women who is easy
2282,Any hot/ sexy chicks.,1.0,any hot sexy chicks
2283,Any vaginal secretion,1.0,any vaginal secretion
2284,Person who slaps hoes,1.0,person who slaps hoes


## Data Preparation
### Step 3: Create a new column without stopwords
Using STOPWORDS from the WordCloud library, we will remove stopwords from the text in the 'cleaned_definition' column


In [3]:
# Step 3: Remove stopwords from the text in 'cleaned_definition'
pat = r'\b(?:{})\b'.format('|'.join(STOPWORDS))
df['cleaned_definition_nostop'] = df['cleaned_definition'].str.replace(pat, '')
df['cleaned_definition_nostop'] = df['cleaned_definition_nostop'].str.replace(r'\s+', ' ')
df

  df['cleaned_definition_nostop'] = df['cleaned_definition'].str.replace(pat, '')
  df['cleaned_definition_nostop'] = df['cleaned_definition_nostop'].str.replace(r'\s+', ' ')


Unnamed: 0,Definition,is_misogyny,cleaned_definition,cleaned_definition_nostop
0,Ur gonna die... queer,0.0,ur gonna die queer,ur gonna die queer
1,Valuptuous man boobs.,0.0,valuptuous man boobs,valuptuous man boobs
2,Variation of brother.,0.0,variation of brother,variation brother
3,Very impressive penis,0.0,very impressive penis,impressive penis
4,What I call my penis.,0.0,what i call my penis,call penis
...,...,...,...,...
2281,"A women who is ""easy""",1.0,a women who is easy,women easy
2282,Any hot/ sexy chicks.,1.0,any hot sexy chicks,hot sexy chicks
2283,Any vaginal secretion,1.0,any vaginal secretion,vaginal secretion
2284,Person who slaps hoes,1.0,person who slaps hoes,person slaps hoes


### Step 4: Import TfidfVectorizer
To prepare our data into a form that will be usable for machine learning, we will turn our text data into a binary vector. 

Term frequency-inverse document frequecy (Tf-idf) is a score that highlights words that are more interesting, i.e. words that occur in a document but not across many documents.


In [4]:
# Step 4: Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

### Step 5: Vectorize your 'cleaned_definition_nostop'
Time to vectorize the 'cleaned_definiton_nostop' column and get each word's tf-idf score.

It works like this:
1. Assign a variable with the TfidfVectorize object
2. Use the .fit_transform method from the object on the column values

The result is a sparse matrix (not a DataFrame yet). 

In [5]:
# Step 5a: Declare your TfidVectorizer object
vectorizer = TfidfVectorizer()

# Step 5b: Perform a fit_transform method with the 'cleaned_definition_nostop' column values
vectorised_def = vectorizer.fit_transform(df['cleaned_definition_nostop'])
vectorised_def

<2286x16269 sparse matrix of type '<class 'numpy.float64'>'
	with 53506 stored elements in Compressed Sparse Row format>

### Step 6: Turn the sparse matrix into a DataFrame
The sparse matrix data type happens because we need to tokenize each word and set it up as columns. Since the resulting matrix can be potentially huge, the data object is created for loading efficiency

In [6]:
# Step 6: Convert the sparse matrix into a DataFrame
df_sparse = pd.DataFrame.sparse.from_spmatrix(vectorised_def, columns = vectorizer.get_feature_names())
df_sparse



Unnamed: 0,010,02,034,04,0702pm,0708pm,07rnrntheir,095lbsrnbirthday,0chan,10,...,ûïsight,ûïsuspicious,ûïthe,ûïtrippin,ûïworld,ûò,ûònounnn1,ûòverb,ûòverbnn1rnto,ûóhe
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2281,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2282,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2283,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2284,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Step 7: Import machine learning libraries
Import the following machine learning libraries:
1. train_test_split from sklearn.model_selection
2. DummyClassifier from sklearn.dummy
3. LogisticRegression from sklearn.linear_model
4. DecisionTreeClassifier from sklearn.tree
5. RandomForestClassifier from sklearn.ensemble
6. f1_score from sklearn.metrics
7. confusion_matrix from sklearn.metrics

In [7]:
# Step 7: Import your ML libraries
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

### Step 8: Prepare the independent and dependent variables
Now that we have everything done, let's prepare our independent variables (the TF-IDF DataFrame and the dependent variable - 'is_misogyny'. 

In [8]:
# Step 8: Prepare your indepedent and independent variables
independent_variables = df_sparse
dependent_variable = df['is_misogyny'].to_list()

### Step 9: Split your indepedent and dependent variables into train and test sets
We'll be using a 80/20 split for train and test set respectively, using the train_test_split function, stratified by y. 

In [9]:
# Step 9: Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(independent_variables, dependent_variable, stratify = dependent_variable)

### Step 10: Train a DummyClassifier model
This is what we'll need to do:
1. Start with a model
2. Declare a variable, and store the model in it
3. Fit the training data into the instantiated model
4. Declare a variable that contains predictions from the model we just trained, using the train dataset (X_test)
5. Print the f1_score between the actual y values and the predictions
6. Print the confusion matrix between the two values

We will start with DummyClassifier to establish the baseline.

In [10]:
# Step 10a: Declare a variable to store the DummyClassifier model
dummy_cls = DummyClassifier()

# Step 10b: Fit your train dataset
dummy_cls.fit(X_train, y_train)

# Step 10c: Declare a variable and store your predictions that you make with your model using X test data
predicted = dummy_cls.predict(X_test)

# Step 10d: Print the f1_score between the y test and prediction
F1 = f1_score(y_test, predicted)

# Step 10e: Print the confusion matrix using the y test and prediction
cfn_matrix = confusion_matrix(y_test, predicted)

In [11]:
model_type = "Dummy Classifier"
print(model_type)
print('=' * len(model_type))
print("F1 Score: {}".format(F1))
print('')
print('Confusion Matrix')
print(cfn_matrix)

Dummy Classifier
F1 Score: 0.0

Confusion Matrix
[[313   0]
 [259   0]]


### Step 11: Train a LogisticRegression model
Train a LogisticRegression model and assess its performance.

In [12]:
# Step 11a: Declare a variable to store the LogisticRegression model
logreg = LogisticRegression()

# Step 11b: Fit your train dataset
logreg.fit(X_train, y_train)

# Step 11c: Declare a variable and store your predictions that you make with your model using X test data
predicted = logreg.predict(X_test)

# Step 11d: Print the f1_score between the y test and prediction
F1 = f1_score(y_test, predicted)

# Step 11e: Print the confusion matrix using the y test and prediction
cfn_matrix = confusion_matrix(y_test, predicted)


In [13]:
model_type = "Logistic Regression"
print(model_type)
print('=' * len(model_type))
print("F1 Score: {}".format(F1))
print('')
print('Confusion Matrix')
print(cfn_matrix)

Logistic Regression
F1 Score: 0.8044444444444444

Confusion Matrix
[[303  10]
 [ 78 181]]


### Step 12: Train a DecisionTreeClassifier model
Let's see if using DecisionTree classifier changes things.

In [14]:
# Step 12a: Train a DecisionTreeClassifer model
dtree = DecisionTreeClassifier()

# Step 12b: Fit your train dataset
dtree.fit(X_train, y_train)

# Step 12c: Declare a variable and store your predictions that you make with your model using X test data
predicted = dtree.predict(X_test)

# Step 12d: Print the f1_score between the y test and prediction
F1 = f1_score(y_test, predicted)

# Step 12e: Print the confusion matrix using the y test and prediction
cfn_matrix = confusion_matrix(y_test, predicted)

In [15]:
model_type = "Decision Tree Classifier"
print(model_type)
print('=' * len(model_type))
print("F1 Score: {}".format(F1))
print('')
print('Confusion Matrix')
print(cfn_matrix)

Decision Tree Classifier
F1 Score: 0.8260038240917782

Confusion Matrix
[[265  48]
 [ 43 216]]


### Step 13: Train a RandomForestClassifier model

In [16]:
# Step 13a: Train a RandomForestClassifier model
rforest = RandomForestClassifier()

# Step 13b: Fit your train dataset
rforest.fit(X_train, y_train)

# Step 13c: Declare a variable and store your predictions that you make with your model using X test data
predicted = rforest.predict(X_test)

# Step 13d: Print the f1_score between the y test and prediction
F1 = f1_score(y_test, predicted)

# Step 13e: Print the confusion matrix using the y test and prediction
cfn_matrix = confusion_matrix(y_test, predicted)

In [17]:
model_type = "Random Forest Classifier"
print(model_type)
print('=' * len(model_type))
print("F1 Score: {}".format(F1))
print('')
print('Confusion Matrix')
print(cfn_matrix)

Random Forest Classifier
F1 Score: 0.8607068607068606

Confusion Matrix
[[298  15]
 [ 52 207]]


### Step 14: Get feature importance from model and create a DataFrame
The results from the RandomForest training yielded the best performance based on the F1 Scores. Let's take a look at what features are important in its classifiying ability.

In [18]:
# Step 14: Create a DataFrame containing the feature importances of the model and sort 

# get the feature importances of the random forest model
importances = rforest.feature_importances_

# create DataFrame and sort
rf_df = pd.DataFrame(data = {'feature': independent_variables.columns, 'importance': importances})
rf_df.sort_values('importance', ascending = False)

Unnamed: 0,feature,importance
15317,vagina,0.049274
5289,female,0.045197
11318,pussy,0.039797
15937,woman,0.023079
10400,penis,0.016607
...,...,...
4944,executive,0.000000
4945,executivesrnrnoften,0.000000
10442,perfection,0.000000
4946,exemplified,0.000000


### Step 15: Repeat Steps 5-6 with max_features
Turns out we don't really need so many features from vectorization. 

Let's repeat our vectorization but <strong>limit the number of features extracted to 100</strong> by adding the 'max_features' parameter.

It is better to limit the features so that we do not spend too much time training our models and this also avoids overfitting. 

In [19]:
# Step 15a: Declare TfidVectorizer object, but add the max_features parameter - 100 features
vectorizer = TfidfVectorizer(max_features= 100)

# Step 15b: Perform a fit_transform method with the 'cleaned_definition_nostop' column values
vectorised_def = vectorizer.fit_transform(df['cleaned_definition_nostop'])
vectorised_def

# Step 15c: Convert the sparse matrix into a new DataFrame
df_sparse = pd.DataFrame.sparse.from_spmatrix(vectorised_def, columns = vectorizer.get_feature_names())
df_sparse



Unnamed: 0,act,always,another,anyone,anything,around,ass,back,band,best,...,vagina,want,way,well,will,without,woman,women,word,world
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2281,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2282,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2283,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2284,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Step 16: Repeat earlier steps
Now that we have a new DataFrame, we'll have to:
1. Prepare the independent variables (dependent variable remains the same)
2. Perform the splitting of the data into train and test
3. Train a a Random Forest model
4. Perform prediction and assess predictions using f1_score and confusion matrix


In [20]:
# Step 16a: Prepare your indepedent and independent variables
independent_variables = df_sparse
dependent_variable = df['is_misogyny'].to_list()

# Step 16b: Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(independent_variables, dependent_variable, stratify = dependent_variable)

# Step 16c: Train a RandomForestClassifier model (or other models)
rforest = RandomForestClassifier()
rforest.fit(X_train, y_train)


# Step 16d: Make a new set of predictions
predicted = rforest.predict(X_test)

# Step 16e: Assess your prediction with f1_score and confusion_matrix
F1 = f1_score(y_test, predicted)
cfn_matrix = confusion_matrix(y_test, predicted)

# Step 16f: Get feature importances
importances = rforest.feature_importances_

# Step 16g: Check the most important features
rf_df = pd.DataFrame(data = {'feature': independent_variables.columns, 'importance': importances})
rf_df.sort_values('importance', ascending = False)

Unnamed: 0,feature,importance
90,vagina,0.121305
23,female,0.107708
68,pussy,0.066100
96,woman,0.054679
75,sex,0.038031
...,...,...
16,doesnt,0.001863
25,first,0.001721
11,black,0.001335
8,band,0.001099


In [21]:
model_type = "Random Forest Classifier"
print(model_type)
print('=' * len(model_type))
print("F1 Score: {}".format(F1))
print('')
print('Confusion Matrix')
print(cfn_matrix)

Random Forest Classifier
F1 Score: 0.7660455486542443

Confusion Matrix
[[274  39]
 [ 74 185]]


### Step 17: Tweak max_feature values
The model with 100 features seems to be worse performing than 16k features. 
Let's tweak the number of max_features from 100 to higher numbers, in increments of 250.

In [22]:
# Step 17: Continuously tweak the max_features parameter until the performance cannot be further improved
# 350 max features
vectorizer = TfidfVectorizer(max_features= 350)
vectorised_def = vectorizer.fit_transform(df['cleaned_definition_nostop'])
df_sparse = pd.DataFrame.sparse.from_spmatrix(vectorised_def, columns = vectorizer.get_feature_names())
df_sparse



Unnamed: 0,act,actually,age,album,almost,although,always,amazing,american,amount,...,word,words,work,world,year,years,yet,youll,young,youre
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2281,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2282,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2283,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2284,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
# Step 17a: Prepare your indepedent and independent variables
independent_variables = df_sparse
dependent_variable = df['is_misogyny'].to_list()

# Step 17b: Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(independent_variables, dependent_variable, stratify = dependent_variable)

# Step 17c: Train a RandomForestClassifier model (or other models)
rforest = RandomForestClassifier()
rforest.fit(X_train, y_train)


# Step 17d: Make a new set of predictions
predicted = rforest.predict(X_test)

# Step 17e: Assess your prediction with f1_score and confusion_matrix
F1 = f1_score(y_test, predicted)
cfn_matrix = confusion_matrix(y_test, predicted)

# Step 17f: Get feature importances
importances = rforest.feature_importances_

# Step 17g: Check the most important features
rf_df = pd.DataFrame(data = {'feature': independent_variables.columns, 'importance': importances})
rf_df.sort_values('importance', ascending = False)

Unnamed: 0,feature,importance
324,vagina,0.101364
92,female,0.084536
337,woman,0.060747
239,pussy,0.060449
67,dick,0.030961
...,...,...
191,members,0.000088
75,due,0.000076
49,child,0.000061
245,reason,0.000055


In [24]:
model_type = "Random Forest Classifier"
print(model_type)
print('=' * len(model_type))
print("F1 Score: {}".format(F1))
print('')
print('Confusion Matrix')
print(cfn_matrix)

Random Forest Classifier
F1 Score: 0.8492063492063493

Confusion Matrix
[[282  31]
 [ 45 214]]


In [25]:
vectorizer = TfidfVectorizer(max_features= 600)
vectorised_def = vectorizer.fit_transform(df['cleaned_definition_nostop'])
df_sparse = pd.DataFrame.sparse.from_spmatrix(vectorised_def, columns = vectorizer.get_feature_names())
# Step 17a: Prepare your indepedent and independent variables
independent_variables = df_sparse
dependent_variable = df['is_misogyny'].to_list()

# Step 17b: Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(independent_variables, dependent_variable, stratify = dependent_variable)

# Step 17c: Train a RandomForestClassifier model (or other models)
rforest = RandomForestClassifier()
rforest.fit(X_train, y_train)


# Step 17d: Make a new set of predictions
predicted = rforest.predict(X_test)

# Step 17e: Assess your prediction with f1_score and confusion_matrix
F1 = f1_score(y_test, predicted)
cfn_matrix = confusion_matrix(y_test, predicted)

# Step 17f: Get feature importances
importances = rforest.feature_importances_

# Step 17g: Check the most important features
rf_df = pd.DataFrame(data = {'feature': independent_variables.columns, 'importance': importances})
display(rf_df.sort_values('importance', ascending = False))

model_type = "Random Forest Classifier"
print(model_type)
print('=' * len(model_type))
print("F1 Score: {}".format(F1))
print('')
print('Confusion Matrix')
print(cfn_matrix)



Unnamed: 0,feature,importance
556,vagina,0.096974
179,female,0.084583
407,pussy,0.054810
583,woman,0.048696
127,dick,0.024311
...,...,...
273,late,0.000012
150,emo,0.000010
262,jesus,0.000000
292,listen,0.000000


Random Forest Classifier
F1 Score: 0.8714859437751004

Confusion Matrix
[[291  22]
 [ 42 217]]


In [26]:
vectorizer = TfidfVectorizer(max_features= 850)
vectorised_def = vectorizer.fit_transform(df['cleaned_definition_nostop'])
df_sparse = pd.DataFrame.sparse.from_spmatrix(vectorised_def, columns = vectorizer.get_feature_names())
# Step 17a: Prepare your indepedent and independent variables
independent_variables = df_sparse
dependent_variable = df['is_misogyny'].to_list()

# Step 17b: Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(independent_variables, dependent_variable, stratify = dependent_variable)

# Step 17c: Train a RandomForestClassifier model (or other models)
rforest = RandomForestClassifier()
rforest.fit(X_train, y_train)


# Step 17d: Make a new set of predictions
predicted = rforest.predict(X_test)

# Step 17e: Assess your prediction with f1_score and confusion_matrix
F1 = f1_score(y_test, predicted)
cfn_matrix = confusion_matrix(y_test, predicted)

# Step 17f: Get feature importances
importances = rforest.feature_importances_

# Step 17g: Check the most important features
rf_df = pd.DataFrame(data = {'feature': independent_variables.columns, 'importance': importances})
display(rf_df.sort_values('importance', ascending = False))

model_type = "Random Forest Classifier"
print(model_type)
print('=' * len(model_type))
print("F1 Score: {}".format(F1))
print('')
print('Confusion Matrix')
print(cfn_matrix)



Unnamed: 0,feature,importance
791,vagina,0.088158
270,female,0.084326
587,pussy,0.053541
828,woman,0.042801
649,sex,0.022618
...,...,...
382,jesus,0.000000
521,osby,0.000000
574,prison,0.000000
17,albums,0.000000


Random Forest Classifier
F1 Score: 0.899009900990099

Confusion Matrix
[[294  19]
 [ 32 227]]


In [27]:
vectorizer = TfidfVectorizer(max_features= 1100)
vectorised_def = vectorizer.fit_transform(df['cleaned_definition_nostop'])
df_sparse = pd.DataFrame.sparse.from_spmatrix(vectorised_def, columns = vectorizer.get_feature_names())
# Step 17a: Prepare your indepedent and independent variables
independent_variables = df_sparse
dependent_variable = df['is_misogyny'].to_list()

# Step 17b: Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(independent_variables, dependent_variable, stratify = dependent_variable)

# Step 17c: Train a RandomForestClassifier model (or other models)
rforest = RandomForestClassifier()
rforest.fit(X_train, y_train)


# Step 17d: Make a new set of predictions
predicted = rforest.predict(X_test)

# Step 17e: Assess your prediction with f1_score and confusion_matrix
F1 = f1_score(y_test, predicted)
cfn_matrix = confusion_matrix(y_test, predicted)

# Step 17f: Get feature importances
importances = rforest.feature_importances_

# Step 17g: Check the most important features
rf_df = pd.DataFrame(data = {'feature': independent_variables.columns, 'importance': importances})
display(rf_df.sort_values('importance', ascending = False))

model_type = "Random Forest Classifier"
print(model_type)
print('=' * len(model_type))
print("F1 Score: {}".format(F1))
print('')
print('Confusion Matrix')
print(cfn_matrix)



Unnamed: 0,feature,importance
1027,vagina,0.085468
344,female,0.079218
758,pussy,0.056302
1075,woman,0.044262
253,dick,0.020109
...,...,...
640,noob,0.000000
81,bands,0.000000
617,mtv,0.000000
1,14,0.000000


Random Forest Classifier
F1 Score: 0.8663967611336033

Confusion Matrix
[[292  21]
 [ 45 214]]


In [28]:
vectorizer = TfidfVectorizer(max_features= 1350)
vectorised_def = vectorizer.fit_transform(df['cleaned_definition_nostop'])
df_sparse = pd.DataFrame.sparse.from_spmatrix(vectorised_def, columns = vectorizer.get_feature_names())
# Step 17a: Prepare your indepedent and independent variables
independent_variables = df_sparse
dependent_variable = df['is_misogyny'].to_list()

# Step 17b: Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(independent_variables, dependent_variable, stratify = dependent_variable)

# Step 17c: Train a RandomForestClassifier model (or other models)
rforest = RandomForestClassifier()
rforest.fit(X_train, y_train)


# Step 17d: Make a new set of predictions
predicted = rforest.predict(X_test)

# Step 17e: Assess your prediction with f1_score and confusion_matrix
F1 = f1_score(y_test, predicted)
cfn_matrix = confusion_matrix(y_test, predicted)

# Step 17f: Get feature importances
importances = rforest.feature_importances_

# Step 17g: Check the most important features
rf_df = pd.DataFrame(data = {'feature': independent_variables.columns, 'importance': importances})
display(rf_df.sort_values('importance', ascending = False))

model_type = "Random Forest Classifier"
print(model_type)
print('=' * len(model_type))
print("F1 Score: {}".format(F1))
print('')
print('Confusion Matrix')
print(cfn_matrix)



Unnamed: 0,feature,importance
1260,vagina,0.084515
425,female,0.071671
926,pussy,0.050577
1320,woman,0.041813
308,dick,0.020693
...,...,...
1273,waist,0.000000
109,battle,0.000000
905,prison,0.000000
823,osby,0.000000


Random Forest Classifier
F1 Score: 0.8706365503080082

Confusion Matrix
[[297  16]
 [ 47 212]]


Evaluating each model's performance, it seems that when max_features is set to 850, we get the best F1 Score on the Random Forest Classifier Model at 0.899.

Beyond 850 max_features, F1 Score starts to drop to 0.866 at 1100 max_features.

## Model testing
### Step 18: Test a few strings and see if they are misogynist or not
Come up with a few strings and see if they are misogynist.

In [29]:
# Step 18a: Declare a few strings containing mock definitions 
string_1 = 'princess'
string_2 = 'dude'
string_3 = 'bitch'

# Step 18b: Append the strings into a list
test_list = [string_1, string_2, string_3]

# Step 18c: .transform these strings using the TfidfVectorizer object
vect_test = vectorizer.transform(test_list)

# Step 18d: Use the trained RandomForest model to predict what class your strings are
rforest.predict(vect_test)



array([0., 0., 1.])