# Objective for Part 1

We will perform the following:
- Download the public data set and perform data cleaning:
- Identify and clean missing values.
- Convert texts to lower cases and remove punctuations.

Our first step will be to understand the source of data we will be analysing.

Go to --> "Data set for automatic detection of online misogynistic speech" (https://www.sciencedirect.com/science/article/pii/S2352340919305773)

In [1]:
# Step 2: Import pandas
import pandas as pd

In [2]:
# Step 3: Read the CSV and specify the right encoding
df = pd.read_csv('ManualTag_Misogyny.csv', encoding='latin-1')

In [3]:
# Step 4: Count the sum of null values in your columns
print(df.isnull().sum())

Definition     0
is_misogyny    1
dtype: int64


In [4]:
# Step 5: Find the index of the row containing missing row
index = df[df['is_misogyny'].isnull()].index
print(df.loc[index])

                                             Definition  is_misogyny
1251  When someone makes a post on Facebook and you ...          NaN


In [5]:
# Step 6: Read the definition of the row with missing value
df['Definition'].loc[index]

1251    When someone makes a post on Facebook and you ...
Name: Definition, dtype: object

In [6]:
# Step 7: Fill the Nan with 0
df['is_misogyny'] = df['is_misogyny'].fillna(0)

In [7]:
# Step 8: Check the sum of nulls again
print(df.isnull().sum())

Definition     0
is_misogyny    0
dtype: int64


In [8]:
# Step 10: Count the values for 'is_misogyny' column
df['is_misogyny'].value_counts()

0.0    1252
1.0    1034
Name: is_misogyny, dtype: int64

In [9]:
# Step 10: Create a new column that contains no punctuation
df['cleaned_definition'] = df['Definition'].str.replace('[^\w\s]','')

In [10]:
# Step 11: Lowercase the values in 'cleaned_definition'
df['cleaned_definition'] = df['cleaned_definition'] .str.lower()

In [14]:
# Step 12: Export the DataFrame to CSV
df.to_csv('ManualTag_Misogyny_Clean.csv', index=False)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2286 entries, 0 to 2285
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Definition          2286 non-null   object 
 1   is_misogyny         2286 non-null   float64
 2   cleaned_definition  2286 non-null   object 
dtypes: float64(1), object(2)
memory usage: 53.7+ KB
