# Cleaning Data Sets

# Import Libraries & Tools

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression 
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.feature_selection import RFE
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


# Import Dataframe using Pandas

**Observation:** The dataset I am using is massive as there are over 24_000 rows to use. Since I will be using an NLP model, I know there will probably be over 50_000 plus columns when I count vectorizer it. However, I do not want to crash my environment. Therefore, I have decided to start with 10_000 rows as a data set.

In [2]:
hate_10k = pd.read_csv('../Data/labeled_data.csv', nrows=10_000)
hate_10k

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
...,...,...,...,...,...,...,...
9995,10267,3,0,3,0,1,"I ain't trying to fuck, bitch. I just want wings."
9996,10268,6,0,6,0,1,I aint mad at you bitches thats what hoes do
9997,10269,3,0,3,0,1,"I aint mad at you, thats what hoes do"
9998,10270,3,0,3,0,1,I aint never had a prob with no other bitch ov...


**Observation** Reviewing the data set columns and understanding their purpose, I noticed that the "Unnamed: 0" column and the index do not match. Furthermore, I noticed that the index row 86 does not match the "Unnamed: 0: at that specific spot. So, I decided to drop the "Unnamed:0" column. This will leave me with six columns instead of seven. 

In [3]:
hate_10k.head(91)

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
...,...,...,...,...,...,...,...
86,87,3,0,3,0,1,"""@BrokenPiecesmsc: @ItsNotAdam faggot read my ..."
87,88,3,0,3,0,1,"""@BrosConfessions: This bitch was so ungratefu..."
88,89,3,0,3,0,1,"""@CASHandBOOBIES: I been kidnapped yo bitch"""
89,90,3,3,0,0,0,"""@CB_Baby24: @white_thunduh alsarabsss"" hes a ..."


# Inspection

In [6]:
hate_10k.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          10000 non-null  int64 
 1   count               10000 non-null  int64 
 2   hate_speech         10000 non-null  int64 
 3   offensive_language  10000 non-null  int64 
 4   neither             10000 non-null  int64 
 5   class               10000 non-null  int64 
 6   tweet               10000 non-null  object
dtypes: int64(6), object(1)
memory usage: 547.0+ KB


In [7]:
# Checking for duplicates

hate_10k.duplicated().sum()

0

In [8]:
# Checking for any 'NaN' values within the data sets. 

hate_10k.isna().sum()

Unnamed: 0            0
count                 0
hate_speech           0
offensive_language    0
neither               0
class                 0
tweet                 0
dtype: int64

In [9]:
#Checking to see if there any "removed" tweets

hate_10k[hate_10k['tweet']== '[removed]'].value_counts().sum()

0

In [10]:
#Checking to see if there any "deleted" tweets

hate_10k[hate_10k['tweet']== '[deleted]'].value_counts().sum()

0

In [80]:
hate_10k.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          10000 non-null  int64 
 1   count               10000 non-null  int64 
 2   hate_speech         10000 non-null  int64 
 3   offensive_language  10000 non-null  int64 
 4   neither             10000 non-null  int64 
 5   class               10000 non-null  int64 
 6   tweet               10000 non-null  object
dtypes: int64(6), object(1)
memory usage: 547.0+ KB


In [11]:
# Dropping the Unnamed: 0 column 

hate_10k = hate_10k.drop(columns='Unnamed: 0')
hate_10k.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   count               10000 non-null  int64 
 1   hate_speech         10000 non-null  int64 
 2   offensive_language  10000 non-null  int64 
 3   neither             10000 non-null  int64 
 4   class               10000 non-null  int64 
 5   tweet               10000 non-null  object
dtypes: int64(5), object(1)
memory usage: 468.9+ KB


In [12]:
#Checking to see that the "Unnamed: 0" column has been dropped

hate_10k.head()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


**Observation** Within the "tweet" column, I have noticed many exclamation points, @ symbols, and other characters. For the corpus of tweets to be appropriately cleaned, I have decided to remove them using the code below. 

In [17]:
# Removed the "@" symbol 

hate_10k['tweet'] = hate_10k['tweet'].str.replace('@', "") 

In [13]:
#Removed the exclamation mark

hate_10k['tweet'] = hate_10k['tweet'].str.replace('!', "")


In [15]:
# Removing the colon from the data set

hate_10k['tweet'] = hate_10k['tweet'].str.replace(':', "")

In [18]:
hate_10k.head()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,RT mayasolovely As a woman you shouldn't comp...
1,3,0,3,0,1,RT mleew17 boy dats cold...tyga dwn bad for c...
2,3,0,3,0,1,RT UrKindOfBrand Dawg RT 80sbaby4life You eve...
3,3,0,2,1,1,RT C_G_Anderson viva_based she look like a tr...
4,6,0,6,0,1,RT ShenikaRoberts The shit you hear about me ...


In [20]:
#Save the clean data to a csv data folder
hate_10k.to_csv('hate10k_clean.csv', index=False)