# Objective for Part 3

As a recap, previously in Part 1 we cleaned our data to remove missing values, convert into lower cases and remove punctuations. For Part 2, we perform analyis using WordCloud to visualize what are the keywords in classifying the Urban Dictionary definitions as misogynistic or non-misogynistic.

In Part 3, we will
- Perform another round of data cleaning: remove stopwords. This is essential for the next step.
- Perform Tfidf Vectorization on data without stopwords. 
- Split data into training and test set.
- Train 4 classification models using the training set: DummyClassifier, LogisticRegression, DecisionTreeClassifier, RandomForestClassifier. 
- For each model, calculate the F1 scores and confusion matrix using the predictions and actual target data from test set.
- From the best model, extract the feature importance attribute to see which features contribute heavily in the classifications.
- Based on the number of important features, adjust the number of features used in the best model's training to obtain a more efficient model.
- Test model with unseen data to see if the predictions are accurate by creating some strings with misogynistic and non-misogynistic words.

In [1]:
# Step 1: Import your libraries 
import pandas as pd
from wordcloud import STOPWORDS

print(STOPWORDS)

{'further', 'i', "isn't", "you're", 'at', 'theirs', 'again', "how's", 'nor', 'most', 'why', 'for', 'few', 'had', 'otherwise', 'their', "they're", 'through', 'while', 'does', 'which', 'because', "who's", "we've", "shan't", 'been', 'between', 'myself', "they'll", "where's", "i'd", 'she', 'its', 'however', 'down', "what's", "when's", 'you', 'be', "doesn't", "here's", "he'll", 'did', 'of', 'it', "wasn't", 'else', 'yourselves', 'is', "haven't", 'do', 'out', "they've", 'like', 'who', 'there', "aren't", 'these', 'your', "won't", 'how', "shouldn't", 'more', 'my', 'also', 'yourself', "couldn't", 'her', 'me', "didn't", 'they', 'only', 'k', 'therefore', 'after', "let's", 'same', 'themselves', 'here', 'get', 'during', "he'd", 'yours', "we'll", "hadn't", 'ourselves', 'having', 'him', 'himself', 'itself', 'or', "don't", 'own', 'if', 'but', "it's", "hasn't", 'too', 'with', 'our', "she'd", 'so', 'up', 'was', 'the', 'above', 'both', 'into', 'shall', "i'll", "can't", 'over', 'herself', 'ought', 'cannot'

In [2]:
# Step 2: Read your CSV
df = pd.read_csv("ManualTag_Misogyny_Clean.csv", index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2286 entries, Ur gonna die... queer to That hoe out there!!!
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   is_misogyny         2286 non-null   float64
 1   cleaned_definition  2286 non-null   object 
dtypes: float64(1), object(1)
memory usage: 53.6+ KB


In [3]:
# Step 3: Remove stopwords from the text in 'cleaned_definition'
stop = list(STOPWORDS)

#for every word in s, check and add to list if it's not stop word
#returns a string of no stopwords
def remove_stopwords(s):
    list_nostop = [item for item in s.split() if item not in stop]
    return " ".join(list_nostop)

df['cleaned_definition_nostop'] = df['cleaned_definition'].map(remove_stopwords)

In [4]:
# Step 4: Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
print(df['cleaned_definition_nostop'].tolist())



In [6]:
corpus = df['cleaned_definition_nostop'].tolist()

# Step 5a: Declare your TfidVectorizer object
vectorizer = TfidfVectorizer()

# Step 5b: Perform a fit_transform method with the 'cleaned_definition_nostop' column values
X = vectorizer.fit_transform(corpus)


In [7]:
print(vectorizer.get_feature_names())



In [8]:
print(X.shape)

(2286, 16289)


In [9]:
# Step 6: Convert the sparse matrix into a DataFrame
df_X = pd.DataFrame.sparse.from_spmatrix(X, columns=vectorizer.get_feature_names())

In [10]:
# Step 7: Import your ML libraries
from  sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, confusion_matrix

In [11]:
# Step 8: Prepare your independent variables
y = df['is_misogyny']

In [12]:
# Step 9: Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df_X, y, test_size=0.2, stratify=y)

In [13]:
# Step 10a: Declare a variable to store the DummyClassifier model
dummy_clf = DummyClassifier()

# Step 10b: Fit your train dataset
dummy_clf.fit(X_train, y_train)

# Step 10c: Declare a variable and store your predictions that you make with your model using X test data
prediction = dummy_clf.predict(X_test)

# Step 10d: Print the f1_score between the y test and prediction
print(f1_score(y_test, prediction))

# Step 10e: Print the confusion matrix using the y test and prediction
print(confusion_matrix(y_test, prediction))

0.42685851318944845
[[130 121]
 [118  89]]




In [14]:
# Step 11a: Declare a variable to store the LogisticRegression model
lr = LogisticRegression()

# Step 11b: Fit your train dataset
lr.fit(X_train, y_train)

# Step 11c: Declare a variable and store your predictions that you make with your model using X test data
pred_lr = lr.predict(X_test)

# Step 11d: Print the f1_score between the y test and prediction
print(f1_score(y_test, pred_lr))

# Step 11e: Print the confusion matrix using the y test and prediction
print(confusion_matrix(y_test, pred_lr))


0.8133704735376045
[[245   6]
 [ 61 146]]


In [15]:
# Step 12: Train a DecisionTreeClassifer model
# Step 12a: Declare a variable to store the DecisionTreeClassifer model
tree = DecisionTreeClassifier()

# Step 12b: Fit your train dataset
tree.fit(X_train, y_train)

# Step 12c: Declare a variable and store your predictions that you make with your model using X test data
pred_tree = tree.predict(X_test)

# Step 12d: Print the f1_score between the y test and prediction
print(f1_score(y_test, pred_tree))

# Step 12e: Print the confusion matrix using the y test and prediction
print(confusion_matrix(y_test, pred_tree))


0.8640776699029126
[[224  27]
 [ 29 178]]


In [16]:
# Step 13: Train a RandomForestClassifier model
# Step 13a: Declare a variable to store the RandomForestClassifier model
rf = RandomForestClassifier()

# Step 13b: Fit your train dataset
rf.fit(X_train, y_train)

# Step 13c: Declare a variable and store your predictions that you make with your model using X test data
pred_rf = rf.predict(X_test)

# Step 13d: Print the f1_score between the y test and prediction
print(f1_score(y_test, pred_rf))

# Step 13e: Print the confusion matrix using the y test and prediction
print(confusion_matrix(y_test, pred_rf))


0.8984771573604061
[[241  10]
 [ 30 177]]


In [17]:
# Step 14: Create a DataFrame containing the feature importances of your model and sort! 
df_rf = pd.DataFrame(rf.feature_importances_, index=df_X.columns, columns=['feature importance'])

In [18]:
df_rf[df_rf['feature importance'] > 0].sort_values(by='feature importance',ascending=False)

Unnamed: 0,feature importance
vagina,6.695843e-02
female,5.572821e-02
pussy,3.473789e-02
woman,2.875108e-02
penis,1.479593e-02
...,...
hernnnote,5.151689e-09
philadelphia,4.619423e-09
techno,4.582008e-09
juggalette,3.989479e-09


In [19]:
# Step 15a: Declare your TfidVectorizer object, but add the max_features parameter - 100 features
vec = TfidfVectorizer(max_features=100)

# Step 15b: Perform a fit_transform method with the 'cleaned_definition_nostop' column values
X2 = vec.fit_transform(corpus)

# Step 15c: Convert the sparse matrix into a new DataFrame
df_X2 = pd.DataFrame.sparse.from_spmatrix(X2, columns=vec.get_feature_names())

In [20]:
# Step 16a: Prepare your independent variables

# Step 16b: Split your data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df_X2, y, test_size=0.2, stratify=y)

# Step 16c: Train a RandomForestClassifier model (or other models)
rf2 = RandomForestClassifier()

# Step 16d: Make a new set of predictions
rf2.fit(X_train, y_train)
pred_rf2 = rf2.predict(X_test)

# Step 16e: Assess your prediction with f1_score and confusion_matrix
print(f1_score(y_test, pred_rf2))
print(confusion_matrix(y_test, pred_rf2))

0.8333333333333334
[[227  24]
 [ 42 165]]


In [30]:
# Step 17a: Tweak your TfidVectorizer object
vec = TfidfVectorizer(max_features=850)
X2 = vec.fit_transform(corpus)
df_X2 = pd.DataFrame.sparse.from_spmatrix(X2, columns=vec.get_feature_names())

# Step 17b: Retrain your RandomForestClassifier model
X_train, X_test, y_train, y_test = train_test_split(df_X2, y, test_size=0.2, stratify=y)

rf2 = RandomForestClassifier()
rf2.fit(X_train, y_train)
pred_rf2 = rf2.predict(X_test)
print(f1_score(y_test, pred_rf2))
print(confusion_matrix(y_test, pred_rf2))

0.8759493670886077
[[236  15]
 [ 34 173]]


In [31]:
# Step 18a: Declare a few strings containing mock definitions 
str1 = "problems bitch need machine gun"
str2 = "around world speak language booty explaining understand talk dirty"
str3 = "Mary had little lamb"

# Step 18b: Append the strings into a list
new_corpus = [str1,str2,str3]

# Step 18c: .transform these strings using the TfidfVectorizer object
new_X_test = vec.transform(new_corpus)

# Step 18d: Use the trained RandomForest model to predict what class your strings are
new_pred = rf2.predict(new_X_test)

print("new corpus prediction:", new_pred)

new corpus prediction: [1. 0. 0.]
