# Project Scenario

There has been a lot of complaints regarding hate speech, misogyny, and racism in the definitions found in the database of Urban Dictionary.

You decide to manually annotate the definitions and judge whether the definition is misogynistic or otherwise.

Misogyny means "dislike of, contempt for, or ingrained prejudice against women.", i.e. any definitions that disparage women.

In this project, we will use the annotated dataset to build a machine learning model to classify text for misogyny.

### What we'll be doing:
In this project, we will do the following:

1. Download Urban Dictionary text corpus and clean the data up (Part I)
2. Perform wordcloud analysis (Part II)
3. Transform the data for feature extraction and apply machine learning techniques to train a model to predict whether an Urban Dictionary definiton contains misogynistic text  (Part III)

### Step 1: Download the dataset and unzip
The dataset that we will work with comes from the hard work of Theodore Lynn and her team, who manually annotated Urban Dictionary definitions.

The data can be found <a href = 'https://md-datasets-cache-zipfiles-prod.s3.eu-west-1.amazonaws.com/3jfwsdkryy-3.zip'>here</a>



### Step 2: Import pandas and relevant libraries

In [5]:
# Step 2: Import libraries
import pandas as pd
import re
import string

### Step 3: Read the CSV as a DataFrame

In [6]:
# Step 3: Read the CSV and specify the right encoding
df = pd.read_csv('ManualTag_Misogyny.csv', encoding = 'latin-1')
df

Unnamed: 0,Definition,is_misogyny
0,Ur gonna die... queer,0.0
1,Valuptuous man boobs.,0.0
2,Variation of brother.,0.0
3,Very impressive penis,0.0
4,What I call my penis.,0.0
...,...,...
2281,"A women who is ""easy""",1.0
2282,Any hot/ sexy chicks.,1.0
2283,Any vaginal secretion,1.0
2284,Person who slaps hoes,1.0


### Step 4: Find the total number of rows with empty values
Let's count the total number of missing data in each column.

In [8]:
# Step 4: Count the sum of null values in your columns
print("Null values in Definition column:", df['Definition'].isnull().any().sum())
print("Null values in is_misogyny column:", df['is_misogyny'].isnull().any().sum())

Null values in Definition column: 0
Null values in is_misogyny column: 1


### Step 5: Identify where the missing row is
Turns out there is almost no missing data, except one particular row.

In [11]:
# Step 5: Find the index of the row containing missing value
df[df['is_misogyny'].isnull()].index[0]

1251

### Step 6: Read the definition of the row containing the missing 'is_misogyny' value

In [13]:
# Step 6: Read the definition of the row with missing value
df.iloc[df[df['is_misogyny'].isnull()].index[0]][0]

"When someone makes a post on Facebook and you and another friend (the double teamers) both comment on the post in order to make fun of the original post. The comments almost always end in smiley faces to express sarcastic joy or happiness. The comments need to be made close together time-wise and should convey a similar message.\\n\\nThis is often used when a couple are getting too lovey dovey on Facebook and you want to ruin the moment and make it awkward for both of them. It is best to use it when the couple just like each other and aren't going out yet. Use it then for optimum awkwardness. Facebook double teaming is also used to make fun of a dude who is whipped by his girl. [Whit-cheee]\\n\\nIt can also be used to argue and make fun of people who make stupids posts or statuses."

### Step 7: Replace NaN with the appropriate label
Looks like the definition is harmless, replace it with a 0.


In [14]:
# Step 7: Fill the NaN with 0
df = df.fillna(0)
df

Unnamed: 0,Definition,is_misogyny
0,Ur gonna die... queer,0.0
1,Valuptuous man boobs.,0.0
2,Variation of brother.,0.0
3,Very impressive penis,0.0
4,What I call my penis.,0.0
...,...,...
2281,"A women who is ""easy""",1.0
2282,Any hot/ sexy chicks.,1.0
2283,Any vaginal secretion,1.0
2284,Person who slaps hoes,1.0


### Step 8: Check if your replacement is successful
Make sure there's no more NaNs

In [15]:
# Step 8: Check the sum of nulls again
print("Null values in Definition column:", df['Definition'].isnull().any().sum())
print("Null values in is_misogyny column:", df['is_misogyny'].isnull().any().sum())

Null values in Definition column: 0
Null values in is_misogyny column: 0


### Step 9: Count the values for 'is_misogyny' column
Let's take a look at the proportion of definitions which are not misogynist and misogynist.

In [16]:
# Step 9: Count the values for 'is_misogyny' column
df['is_misogyny'].value_counts()

0.0    1252
1.0    1034
Name: is_misogyny, dtype: int64

## Data cleaning
We'll be doing two things:
1. removing punctuations from our definitions
2. lowercasing the text

### Step 10: Remove punctuation from the text
We'll start with removing punctuation from all of the values in the 'definition' column. 

Create a new column named 'cleaned_definition' that contains the text without punctuations.

In [18]:
# Step 10: Create a new column that contains no punctuation
punct = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{}~'   # `|` is not present here
transtab = str.maketrans(dict.fromkeys(punct, ''))
df['cleaned_definition'] = '|'.join(df['Definition'].tolist()).translate(transtab).split('|')
df

Unnamed: 0,Definition,is_misogyny,cleaned_definition
0,Ur gonna die... queer,0.0,Ur gonna die queer
1,Valuptuous man boobs.,0.0,Valuptuous man boobs
2,Variation of brother.,0.0,Variation of brother
3,Very impressive penis,0.0,Very impressive penis
4,What I call my penis.,0.0,What I call my penis
...,...,...,...
2281,"A women who is ""easy""",1.0,A women who is easy
2282,Any hot/ sexy chicks.,1.0,Any hot sexy chicks
2283,Any vaginal secretion,1.0,Any vaginal secretion
2284,Person who slaps hoes,1.0,Person who slaps hoes


### Step 11: Lower the cases in the new column
Next we lower the cases in the values in 'cleaned_definition' so that strings that are in different cases can be counted as the same string in subsequent analyses. 

In [19]:
# Step 11: Lowercase the values in 'cleaned_definition'
df['cleaned_definition'] = df['cleaned_definition'].str.lower()
df

Unnamed: 0,Definition,is_misogyny,cleaned_definition
0,Ur gonna die... queer,0.0,ur gonna die queer
1,Valuptuous man boobs.,0.0,valuptuous man boobs
2,Variation of brother.,0.0,variation of brother
3,Very impressive penis,0.0,very impressive penis
4,What I call my penis.,0.0,what i call my penis
...,...,...,...
2281,"A women who is ""easy""",1.0,a women who is easy
2282,Any hot/ sexy chicks.,1.0,any hot sexy chicks
2283,Any vaginal secretion,1.0,any vaginal secretion
2284,Person who slaps hoes,1.0,person who slaps hoes


### Step 12: Export cleaned DataFrame as CSV
Now that we've done some basic cleaning, time to export our DataFrame to a CSV file.

In [20]:
# Step 12: Export the DataFrame to CSV
df.to_csv('CleanedMisogyny.csv')