# Data Cleaning
We have two dataset:
* One to train our models (see data consolidation) called 'verified', this dataset compiles different sources of labeled tweets account as human or bot.
* Other dataset to test our models, this dataset has tweets account that posted information about COVID. This dataset will be called COVID.

On this notebook we will clean the data and get a final version of this datasets.

In [None]:
#Import basic libraries
import pandas as pd
import numpy as np
import pipeline as p
import warnings

warnings.filterwarnings('ignore')

# "Verified" Dataset cleaning
We will explore all the columns of this dataset and after all the cleaning process we wil hace a new cvs file.

In [None]:
#Read the complete dataset
df = pd.read_csv("C:/Users/Maca/Documents/project_ml/Project-Machine-Learning-CAPP/data_consolidation/consolidated_version2.csv")

In [None]:
#print some rows to see the data
df.head(2)

In [None]:
#Columns of the dataset
df.columns

In [None]:
#Get some stats about the columns
p.describe(df)

In [None]:
#Found the NaN values for each column
for i in df.columns:
    print("Found {} NaN {} records.".format(df[i].isna().sum(), i))

In [None]:
#Our target value is column 'bot', we dropped all the NaN bot column
df = df.dropna(subset=['bot'])

## Target - Bot column
Column bot is our target column.
For that reason we need to transfrom into a dummy column. If 'bot' column has the value bot we will assign a number 1 and 0 otherwise (value human).

In [None]:
#Transform bot column into dummy
df['bot'] = df.loc[:, 'bot'].apply(lambda x: 0 if x == 'human' else 1)

## Lenguage Description Column
Column lenguage is one of our categorical features. If the lenguage of the tweeter account is english we will assign a number 1 and 0 otherwise.

In [None]:
#if no lenguage is English then attribute is 0, 1 otherwise
df['len_en'] = df.loc[:, 'lang'].apply(lambda x: 1 if x == 'en' else 0)
df = df.drop(['lang'], axis=1)

## Bio Description Column
Column description is one of our categorical features. If the bio has a description on the tweeter account we will assign a number 1 and 0 otherwise.

In [None]:
#Fill NaN description, if no descrption is 0 otherwise 1
df['description'] = df.description.fillna(0)
df['has_description'] = df.loc[:, 'description'].apply(lambda x: 0 if x == 0 else 1)
df = df.drop(['description'], axis=1)

## Finally select only the features for the model

In [None]:
#Model only with to features that we need
verified= df[['bot', 'verified', 'geo_enabled', 'default_profile', 'has_description', 'len_en', 'followers_count', 
                  'friends_count', 'listed_count', 'favourites_count', 'statuses_count']]
columns = {'followers_count':'followers','friends_count': 'friends', 'favourites_count': 'likes', 'statuses_count': 'tweets'}
verified = verified.rename(columns=columns)

In [None]:
verified.head(2)

In [None]:
verified.to_csv(r'C:\Users\Maca\Documents\project_ml\Project-Machine-Learning-CAPP\Data-Cleaning\verified.csv', index = False)

# "COVID" Dataset cleaning
We will explore all the columns of this dataset and after all the cleaning process we wil hace a new cvs file.

In [None]:
df2 = pd.read_csv(r'C:\Users\Maca\Documents\project_ml\Project-Machine-Learning-CAPP\covid_user_info.csv')

In [None]:
df2.head(2)

In [None]:
df2.columns

## Features and target of COVID dataset
The idea behind this dataset is to test our trained model. This dataset doesn't contain the target column labeled as 'bot' or 'human' but it has almost all the features of our model.

In [None]:
covid_filtered = df2[['bio', 'followers', 'following', 'location', 'tweets','likes', 'verified']]
columns2 = {'following': 'friends', 'location': 'geo_enabled'}
covid_filtered = covid_filtered.rename(columns=columns2)

In [None]:
#Get the NaN values
for i in covid_filtered.columns:
    print("Found {} NaN {} records.".format(covid_filtered[i].isna().sum(), i))

### Bio Description Column
Column description is one of our categorical features. If the bio has a description on the tweeter account we will assign a number 1 and 0 otherwise.

In [None]:
#Fill NaN description, if no descrption is 0 otherwise 1
covid_filtered['has_description '] = covid_filtered.has_description.fillna(0)
covid_filtered['has_description'] = covid_filtered.loc[:, 'has_description '].apply(lambda x: 0 if x == 0 else 1)
covid_filtered = covid_filtered.drop(['bio'], axis=1)

### Geolocalization Description Column
Column geolocalization is one of our categorical features. If the geolocalization is enabled we will assign a number 1 and 0 otherwise.

In [None]:
#Fill NaN geolocalization, if no geolocalization is 0 otherwise 1
covid_filtered['geo_enabled'] = covid_filtered.geo_enabled.fillna(0)
covid_filtered['geo_enabled'] = covid_filtered.loc[:, 'has_description '].apply(lambda x: 0 if x == 0 else 1)

### Categorical column: Verified records
As we have 26 null values on the verified column, we will remove them from the dataset.

In [None]:
covid_filtered = covid_filtered[covid_filtered['verified'].notna()]

### Continuous columns: Replace NaNs with median value of the dataset
Our continuous columns are: followers, friends, tweets and likes.

In [None]:
continuous = ['followers', 'friends', 'tweets', 'likes']
d = {}
print('\033[4m"Data before fillna with median value:"\n\x1b[0m', covid_filtered[continuous].isna().sum())
for i in continuous:
    d[i] = covid_filtered[i].median()
print('\n\033[1m\33[92m"Median Values to fill"', d, '\x1b[0m\n')
covid_filtered = covid_filtered.fillna(value=d)
print('\033[4m\033[94m"Sanity check: Data after fillna with median value"\n\x1b[0m', covid_filtered[continuous].isna().sum())

In [None]:
#Get the NaN values
for i in covid_filtered.columns:
    print("Found {} NaN {} records.".format(covid_filtered[i].isna().sum(), i))

In [None]:
covid_filtered.to_csv(r'C:\Users\Maca\Documents\project_ml\Project-Machine-Learning-CAPP\Data-Cleaning\COVID_tweet.csv', index = False)