# Data Cleaning
We have two dataset:
* One to train our models (see data consolidation) called 'verified', this dataset compiles different sources of labeled tweets account as human or bot.
* Other dataset to test our models, this dataset has tweets account that posted information about COVID. This dataset will be called COVID.

On this notebook we will clean the data and get a final version of this datasets.

In [1]:
#Import basic libraries
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

# "Verified" Dataset cleaning
We will explore all the columns of this dataset and after all the cleaning process we wil hace a new cvs file.

In [2]:
#Read the complete dataset
df = pd.read_csv("C:/Users/Maca/Documents/project_ml/Project-Machine-Learning-CAPP/data_consolidation/consolidated_version2.csv")

In [3]:
#print some rows to see the data
df.head(2)

Unnamed: 0.1,Unnamed: 0,id,bot,description,probe_timestamp,created_at,lang,protected,verified,geo_enabled,default_profile,followers_count,friends_count,listed_count,favourites_count,statuses_count,source
0,0,3039154799,human,••TEEN WOLF//SKAM//SHAMELESS••Il mio livello d...,Thu May 16 13:57:12 +0000 2019,Sun Feb 15 14:56:36 +0000 2015,it,False,0.0,0.0,1.0,163,407,0,4193,5761,cresci-rtbust-2019
1,1,390617262,bot,,Tue Apr 16 13:51:17 +0000 2019,Fri Oct 14 08:00:55 +0000 2011,it,False,0.0,0.0,1.0,289,401,1,213,3210,cresci-rtbust-2019


In [4]:
#Columns of the dataset
df.columns

Index(['Unnamed: 0', 'id', 'bot', 'description', 'probe_timestamp',
       'created_at', 'lang', 'protected', 'verified', 'geo_enabled',
       'default_profile', 'followers_count', 'friends_count', 'listed_count',
       'favourites_count', 'statuses_count', 'source'],
      dtype='object')

In [5]:
#Get some stats about the columns
df.describe()

Unnamed: 0.1,Unnamed: 0,id,verified,geo_enabled,default_profile,followers_count,friends_count,listed_count,favourites_count,statuses_count
count,63264.0,63264.0,55721.0,56860.0,56166.0,63264.0,63264.0,63264.0,63264.0,63264.0
mean,20393.214356,7.179809e+17,0.057267,0.155575,0.816704,51401.14,1220.452,160.520407,3030.690709,6684.949
std,16265.554666,4.80538e+17,0.232355,0.362455,0.386912,1389009.0,19959.91,3696.298209,15509.481248,39389.06
min,0.0,586.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3273.0,2369773000.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,11.0
50%,18905.5,1.050035e+18,0.0,0.0,1.0,2.0,36.0,0.0,0.0,45.0
75%,34721.25,1.056234e+18,0.0,0.0,1.0,100.0,267.25,1.0,125.0,318.25
max,50537.0,1.079456e+18,1.0,1.0,1.0,106938000.0,2141379.0,606500.0,886115.0,2766520.0


In [6]:
#Found the NaN values for each column
for i in df.columns:
    print("Found {} NaN {} records.".format(df[i].isna().sum(), i))

Found 0 NaN Unnamed: 0 records.
Found 0 NaN id records.
Found 7543 NaN bot records.
Found 32995 NaN description records.
Found 0 NaN probe_timestamp records.
Found 0 NaN created_at records.
Found 2987 NaN lang records.
Found 11086 NaN protected records.
Found 7543 NaN verified records.
Found 6404 NaN geo_enabled records.
Found 7098 NaN default_profile records.
Found 0 NaN followers_count records.
Found 0 NaN friends_count records.
Found 0 NaN listed_count records.
Found 0 NaN favourites_count records.
Found 0 NaN statuses_count records.
Found 0 NaN source records.


In [7]:
#Our target value is column 'bot', we dropped all the NaN bot column
df = df.dropna(subset=['bot'])

## Target - Bot column
Column bot is our target column.
For that reason we need to transfrom into a dummy column. If 'bot' column has the value bot we will assign a number 1 and 0 otherwise (value human).

In [8]:
#Transform bot column into dummy
df['bot'] = df.loc[:, 'bot'].apply(lambda x: 0 if x == 'human' else 1)

## Lenguage Description Column
Column lenguage is one of our categorical features. If the lenguage of the tweeter account is english we will assign a number 1 and 0 otherwise.

In [9]:
#if no lenguage is English then attribute is 0, 1 otherwise
df['len_en'] = df.loc[:, 'lang'].apply(lambda x: 1 if x == 'en' else 0)
df = df.drop(['lang'], axis=1)

## Bio Description Column
Column description is one of our categorical features. If the bio has a description on the tweeter account we will assign a number 1 and 0 otherwise.

In [10]:
#Fill NaN description, if no descrption is 0 otherwise 1
df['description'] = df.description.fillna(0)
df['has_description'] = df.loc[:, 'description'].apply(lambda x: 0 if x == 0 else 1)
df = df.drop(['description'], axis=1)

## Finally select only the features for the model

In [11]:
#Model only with to features that we need
verified= df[['bot', 'verified', 'geo_enabled', 'default_profile', 'has_description', 'len_en', 'followers_count', 
                  'friends_count', 'listed_count', 'favourites_count', 'statuses_count']]
columns = {'followers_count':'followers','friends_count': 'friends', 'favourites_count': 'likes', 'statuses_count': 'tweets'}
verified = verified.rename(columns=columns)

In [12]:
verified.head(2)

Unnamed: 0,bot,verified,geo_enabled,default_profile,has_description,len_en,followers,friends,listed_count,likes,tweets
0,0,0.0,0.0,1.0,1,0,163,407,0,4193,5761
1,1,0.0,0.0,1.0,0,0,289,401,1,213,3210


In [13]:
verified.to_csv(r'C:\Users\Maca\Documents\project_ml\Project-Machine-Learning-CAPP\Data-Cleaning\verified.csv', index = False)

# "COVID" Dataset cleaning
We will explore all the columns of this dataset and after all the cleaning process we wil hace a new cvs file.

In [14]:
df2 = pd.read_csv(r'C:\Users\Maca\Documents\project_ml\Project-Machine-Learning-CAPP\covid_user_info.csv')

In [15]:
df2.head(2)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,avatar,background_image,bio,followers,following,user_id,join_date,join_datetime,...,likes,location,media,name,private,tweets,url,username,verified,retrieved_info
0,0,0.0,https://pbs.twimg.com/profile_images/125764305...,https://pbs.twimg.com/profile_banners/44728980...,News updates & breaking news from the Philippi...,4955553.0,771.0,44728980.0,4-Jun-09,6/4/2009 14:26,...,5444.0,Philippines,128000,ABS-CBN News Channel,0.0,719761.0,http://news.abs-cbn.com/anc,ANCALERTS,1.0,retrieved_info
1,1,1.0,https://pbs.twimg.com/profile_images/118518765...,https://pbs.twimg.com/profile_banners/11851227...,Turkey's own independent gazette,16092.0,1.0,1.19e+18,18-Oct-19,10/18/2019 2:17,...,10.0,,3448,Duvar English,0.0,7794.0,http://www.duvarenglish.com,DuvarEnglish,0.0,retrieved_info


In [16]:
df2.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'avatar', 'background_image', 'bio',
       'followers', 'following', 'user_id', 'join_date', 'join_datetime',
       'join_time', 'likes', 'location', 'media', 'name', 'private', 'tweets',
       'url', 'username', 'verified', 'retrieved_info'],
      dtype='object')

## Features and target of COVID dataset
The idea behind this dataset is to test our trained model. This dataset doesn't contain the target column labeled as 'bot' or 'human' but it has almost all the features of our model.

In [17]:
covid_filtered = df2[['username', 'avatar','bio', 'followers', 'following', 'location', 'tweets','likes', 'verified']]
columns2 = {'following': 'friends', 'location': 'geo_enabled'}
covid_filtered = covid_filtered.rename(columns=columns2)

In [18]:
#Get the NaN values
for i in covid_filtered.columns:
    print("Found {} NaN {} records.".format(covid_filtered[i].isna().sum(), i))

Found 26 NaN username records.
Found 6 NaN avatar records.
Found 1124 NaN bio records.
Found 16 NaN followers records.
Found 16 NaN friends records.
Found 4620 NaN geo_enabled records.
Found 26 NaN tweets records.
Found 16 NaN likes records.
Found 26 NaN verified records.


### Bio Description Column
Column description is one of our categorical features. If the bio has a description on the tweeter account we will assign a number 1 and 0 otherwise.

In [19]:
#Fill NaN description, if no descrption is 0 otherwise 1
covid_filtered['bio'] = covid_filtered.bio.fillna(0)
covid_filtered['has_description'] = covid_filtered.loc[:, 'bio'].apply(lambda x: 0 if x == 0 else 1)
covid_filtered = covid_filtered.drop(['bio'], axis=1)

### Geolocalization Description Column
Column geolocalization is one of our categorical features. If the geolocalization is enabled we will assign a number 1 and 0 otherwise.

In [20]:
#Fill NaN geolocalization, if no geolocalization is 0 otherwise 1
covid_filtered['geo_enabled'] = covid_filtered.geo_enabled.fillna(0)
covid_filtered['geo_enabled'] = covid_filtered.loc[:, 'geo_enabled'].apply(lambda x: 0 if x == 0 else 1)

### Categorical column: Verified records
As we have 26 null values on the verified column, we will remove them from the dataset.

In [21]:
covid_filtered = covid_filtered[covid_filtered['verified'].notna()]

In [22]:
continuous = ['followers', 'friends', 'tweets', 'likes']
d = {}
print('\033[4m"Data before fillna with median value:"\n\x1b[0m', covid_filtered[continuous].isna().sum())
for i in continuous:
    d[i] = covid_filtered[i].median()
print('\n\033[1m\33[92m"Median Values to fill"', d, '\x1b[0m\n')
covid_filtered = covid_filtered.fillna(value=d)
print('\033[4m\033[94m"Sanity check: Data after fillna with median value"\n\x1b[0m', covid_filtered[continuous].isna().sum())

[4m"Data before fillna with median value:"
[0m followers    0
friends      0
tweets       0
likes        0
dtype: int64

[1m[92m"Median Values to fill" {'followers': 527.0, 'friends': 599.0, 'tweets': 5144.0, 'likes': 3193.0} [0m

[4m[94m"Sanity check: Data after fillna with median value"
[0m followers    0
friends      0
tweets       0
likes        0
dtype: int64


In [23]:
#Get the NaN values
for i in covid_filtered.columns:
    print("Found {} NaN {} records.".format(covid_filtered[i].isna().sum(), i))

Found 0 NaN username records.
Found 0 NaN avatar records.
Found 0 NaN followers records.
Found 0 NaN friends records.
Found 0 NaN geo_enabled records.
Found 0 NaN tweets records.
Found 0 NaN likes records.
Found 0 NaN verified records.
Found 0 NaN has_description records.


In [24]:
covid_filtered.head()

Unnamed: 0,username,avatar,followers,friends,geo_enabled,tweets,likes,verified,has_description
0,ANCALERTS,https://pbs.twimg.com/profile_images/125764305...,4955553.0,771.0,1,719761.0,5444.0,1.0,1
1,DuvarEnglish,https://pbs.twimg.com/profile_images/118518765...,16092.0,1.0,0,7794.0,10.0,0.0,1
2,KTVOTV,https://pbs.twimg.com/profile_images/976217894...,9187.0,983.0,1,97617.0,165.0,1.0,1
3,YourMorning,https://pbs.twimg.com/profile_images/102135283...,19848.0,2522.0,0,52947.0,19054.0,1.0,1
4,CodaStory,https://pbs.twimg.com/profile_images/113373380...,12321.0,965.0,1,7580.0,2672.0,0.0,1


## Load COVID tweets into a csv file

In [25]:
covid_filtered.to_csv(r'C:\Users\Maca\Documents\project_ml\Project-Machine-Learning-CAPP\Data-Cleaning\COVID_tweet.csv', index = False)