# DS320 Spring 2023: exercise 03
<b> posted on Thu 03/16/23, Due on Tue 03/28/23 at 8:00 AM </b><br>

The focus of this assignment is <b>data cleaning </b> and <b> preparation: deal with numbers, datetime values, and text</b><br>

<i>You should review the pipeline of cleaning and preparing data I drew on the whiteboard and the "data cleaning" slide (try to be clear about 05 steps there) before working on this assignment</i>. 

You will clean tweets taken from https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
for a text sentiment analysis problem.

The tweets is included here but I highly recommend you to visit the above site for more information about the dataset.

There are 10 tasks, each is worth 10 points.

Note: I will mannually grade your code, so no test cases will be provided, but I can give you the expectation of the outcomes for each task.

In [37]:
import pandas as pd
import numpy as np

# Task 01: understand the data set
- Read the whole tweet dataset (in Tweets.csv file) associated with this assignment into a dataframe. Name it `df_tweets`.
From now, you will work on `df_tweets`.
- Find out information about the follows. You must show your code:
    Shape

    Data types of all columns

    Numerical columns

    Text columns

    Categorical columns

    Date/time columns

    Statistics (min, max, mean, std, ...) of all the numerical columns
    
 Expectation: df_tweets.shape = (14640, 15)

In [38]:
###BEGIN SOLUTION
df_tweets = pd.read_csv("./Tweets.csv") 
df_original_tweets = df_tweets.copy()
print(df_tweets.shape)
print(df_tweets.dtypes)
df_tweets.describe()
###END SOLUTION

(14640, 15)
tweet_id                          int64
airline_sentiment                object
airline_sentiment_confidence    float64
negativereason                   object
negativereason_confidence       float64
airline                          object
airline_sentiment_gold           object
name                             object
negativereason_gold              object
retweet_count                     int64
text                             object
tweet_coord                      object
tweet_created                    object
tweet_location                   object
user_timezone                    object
dtype: object


Unnamed: 0,tweet_id,airline_sentiment_confidence,negativereason_confidence,retweet_count
count,14640.0,14640.0,10522.0,14640.0
mean,5.692184e+17,0.900169,0.638298,0.08265
std,779111200000000.0,0.16283,0.33044,0.745778
min,5.675883e+17,0.335,0.0,0.0
25%,5.685592e+17,0.6923,0.3606,0.0
50%,5.694779e+17,1.0,0.6706,0.0
75%,5.698905e+17,1.0,1.0,0.0
max,5.703106e+17,1.0,1.0,44.0


# Task 02: missing values in columns (1)
Create a function, `get_missing_percent_in_cols`, that takes a dataframe as an input and returns a dataframe with 3 columns to store columns'names (name it `column_name`), percentages of missing values in each column (name it `missing_percent`), and datatype of each column (name it `data_type`) in the given dataframe. The returned dataframe needs to be sorted in the descending order of the percentages.

Call the function with `df_tweets`


In [39]:

def get_missing_percent_in_cols(df):
###BEGIN SOLUTION
    percent_missing = df.isnull().sum() * 100 / df.shape[0]
    df_cols = pd.DataFrame(data = {'column_name': df.columns,
                                   'percent_missing': percent_missing,
                                   'data_type': df.dtypes})

    df_cols.sort_values(['percent_missing'], ascending = False, inplace = True)
    return df_cols
###END SOLUTION

#usage
df_cols = get_missing_percent_in_cols(df_tweets)
df_cols


Unnamed: 0,column_name,percent_missing,data_type
negativereason_gold,negativereason_gold,99.781421,object
airline_sentiment_gold,airline_sentiment_gold,99.726776,object
tweet_coord,tweet_coord,93.039617,object
negativereason,negativereason,37.308743,object
user_timezone,user_timezone,32.923497,object
tweet_location,tweet_location,32.329235,object
negativereason_confidence,negativereason_confidence,28.128415,float64
tweet_id,tweet_id,0.0,int64
airline_sentiment,airline_sentiment,0.0,object
airline_sentiment_confidence,airline_sentiment_confidence,0.0,float64


# Task 03: missing values in columns (2)
Create a function, `delete_cols`, that takes a dataframe `df` and a number `threshold` as inputs. The function needs to return a dataframe containing all columns from the input dataframe with missing value percent > `threshold`

Call the function on `df_tweets` and `97`.
 
expectaion: the shape of the returned dataframe = (14640, 13)

In [40]:
def delete_cols(df, threshold):
###BEGIN SOLUTION
    df_cols = get_missing_percent_in_cols(df)
    col_to_del = df_cols[df_cols['percent_missing'] > threshold]['column_name']
    return df.drop(col_to_del, axis='columns', inplace = False)
###END SOLUTION

#usage
df_tweets = delete_cols(df_tweets, 97)
print(df_tweets.shape)


(14640, 13)


# Task 04: missing values in rows (3): Imputation
Create a function, `impute_by_mean`, that takes a dataframe as an input. The function needs to return a dataframe constructed from the input dataframe as follows: for each numerical column, replace all the missing values there with the mean of that column. 

Call the function to impute numerical missing values for `df_tweets`

Expectation: one column will be imputed by the mean of that column
 

In [41]:
def impute_by_mean(df):
###BEGIN SOLUTION
    df_cols = get_missing_percent_in_cols(df)
    col_to_impu = df_cols[(df_cols['percent_missing'] > 0)  &
                      ((df_cols['data_type'] == 'int64') |
                      (df_cols['data_type'] == 'float64'))]['column_name']
    df1 = df.copy()
    df1[col_to_impu] = df1[col_to_impu].fillna(value = df1[col_to_impu].mean(skipna=True, axis = 'rows'),
                                               axis = 'rows', inplace = False)
    
    return df1
###END SOLUTION

#usage
df_tweets = impute_by_mean(df_tweets)
df = get_missing_percent_in_cols(df_tweets)
df


Unnamed: 0,column_name,percent_missing,data_type
tweet_coord,tweet_coord,93.039617,object
negativereason,negativereason,37.308743,object
user_timezone,user_timezone,32.923497,object
tweet_location,tweet_location,32.329235,object
tweet_id,tweet_id,0.0,int64
airline_sentiment,airline_sentiment,0.0,object
airline_sentiment_confidence,airline_sentiment_confidence,0.0,float64
negativereason_confidence,negativereason_confidence,0.0,float64
airline,airline,0.0,object
name,name,0.0,object


# Task 05: Deal with datatime data type 
Create a function, `decompose_datetime`, that takes a series with datetime values as an input. The function needs to return a dataframe with 06 columns, namely "year", "month", "day", "hour", "minute", and "second", containing year, month, day, hour, minute, and second of the input datetime values respectively.

Call the function to decompose the `tweet_created` column in `df_tweets` and then replace that column by the returned dataframe.

Hints: you need to convert the `tweet_create` into datetime data type first
You can let python detect the datetime format or you can specify the format to speed up the converion task.

Expectation: df_tweets.shape = (14640, 18)


In [42]:
def decompose_datetime(s):
###BEGIN SOLUTION
    df = pd.DataFrame(data = {'year': s.dt.year,
                             'month': s.dt.month,
                             'day': s.dt.day,
                             'hour': s.dt.hour,
                             'minute':s.dt.minute,
                             'second': s.dt.second
                             })
    return df
df_tweets['tweet_created'] = pd.to_datetime(df_tweets['tweet_created'])
df = decompose_datetime(df_tweets['tweet_created'])
df_tweets.drop(['tweet_created'], axis = 'columns', inplace = True)
for c in df.columns:
    df_tweets[c] = df[c]
print(df_tweets.shape)
df_tweets.head()
###END SOLUTION


(14640, 18)


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_coord,tweet_location,user_timezone,year,month,day,hour,minute,second
0,570306133677760513,neutral,1.0,,0.638298,Virgin America,cairdin,0,@VirginAmerica What @dhepburn said.,,,Eastern Time (US & Canada),2015,2,24,11,35,52
1,570301130888122368,positive,0.3486,,0.0,Virgin America,jnardino,0,@VirginAmerica plus you've added commercials t...,,,Pacific Time (US & Canada),2015,2,24,11,15,59
2,570301083672813571,neutral,0.6837,,0.638298,Virgin America,yvonnalynn,0,@VirginAmerica I didn't today... Must mean I n...,,Lets Play,Central Time (US & Canada),2015,2,24,11,15,48
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,jnardino,0,@VirginAmerica it's really aggressive to blast...,,,Pacific Time (US & Canada),2015,2,24,11,15,36
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,jnardino,0,@VirginAmerica and it's a really big bad thing...,,,Pacific Time (US & Canada),2015,2,24,11,14,45


In [43]:
#Run this cell for text handing tasks
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import PorterStemmer
import nltk
import re
import sys
import warnings

nltk.download('stopwords')
stop_words = stopwords.words('english')
porter = PorterStemmer()

if not sys.warnoptions:
    warnings.simplefilter("ignore")
    
def cleanHtml(sentence):
    #cleanr = re.compile('<.*?>')
    cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    cleantext = re.sub(cleanr, ' ', str(sentence))
    return cleantext


def cleanPunc(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r' ',sentence)
    cleaned = re.sub(r'[.|,|:|;|)|(|\|/]',r' ',cleaned)
    cleaned = cleaned.strip()
    cleaned = cleaned.replace("\n"," ")
    return cleaned


def keepAlpha(sentence):
    alpha_sent = ""
    for word in sentence.split():
        alpha_word = re.sub('[^a-z A-Z]+', ' ', word)
        alpha_sent += alpha_word
        alpha_sent += " "
    alpha_sent = alpha_sent.strip()
    return alpha_sent

#stop words removing 

stop_words = set(stopwords.words('english'))
stop_words.update(['  ', 'zero','one','two','three','four','five','six','seven','eight','nine','ten','may','also','across','among','beside','however','yet','within'])
re_stop_words = re.compile(r"\b(" + "|".join(stop_words) + ")\\W", re.I)

def removeStopWords(sentence):
    global re_stop_words
    return re_stop_words.sub(" ", sentence)


# stemering

stemmer = SnowballStemmer("english")
def stemming(sentence):
    stemSentence = ""
    for word in sentence.split():
        stem = stemmer.stem(word)
        stemSentence += stem
        stemSentence += " "
    stemSentence = stemSentence.strip()
    return stemSentence


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/LC/doth02/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Task 06 Handle text: lower
Create a function, `lower_text`, that takes a dataframe as an input. The function needs to return a dataframe constructed from the given one by lowercase all the text columns in that dataframe.

Call the function on `df_tweets`
Require: you have to use `applymap` and `lambda`

In [44]:
def lower_text(df):
###BEGIN SOLUTION
    df1 = df.select_dtypes(include='object')
    print(df1.dtypes)
    df1 = df1.applymap(lambda x: x.lower() if isinstance(x, str) else "")
    df_cp = df.copy()
    for c in df1.columns:
        df_cp[c] = df1[c]
    return df_cp
###END SOLUTION

#usage
df_tweets = lower_text(df_tweets)
df_tweets.head()

airline_sentiment    object
negativereason       object
airline              object
name                 object
text                 object
tweet_coord          object
tweet_location       object
user_timezone        object
dtype: object


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_coord,tweet_location,user_timezone,year,month,day,hour,minute,second
0,570306133677760513,neutral,1.0,,0.638298,virgin america,cairdin,0,@virginamerica what @dhepburn said.,,,eastern time (us & canada),2015,2,24,11,35,52
1,570301130888122368,positive,0.3486,,0.0,virgin america,jnardino,0,@virginamerica plus you've added commercials t...,,,pacific time (us & canada),2015,2,24,11,15,59
2,570301083672813571,neutral,0.6837,,0.638298,virgin america,yvonnalynn,0,@virginamerica i didn't today... must mean i n...,,lets play,central time (us & canada),2015,2,24,11,15,48
3,570301031407624196,negative,1.0,bad flight,0.7033,virgin america,jnardino,0,@virginamerica it's really aggressive to blast...,,,pacific time (us & canada),2015,2,24,11,15,36
4,570300817074462722,negative,1.0,can't tell,1.0,virgin america,jnardino,0,@virginamerica and it's a really big bad thing...,,,pacific time (us & canada),2015,2,24,11,14,45


# Task 07 Handle text: strip
Create a function, `strip_text`, that takes a dataframe as an input. The function needs to return a dataframe constructed from the given one by strip all spaces around (leading and trailing) a text value in that dataframe.

Call the function on `df_tweets`
Require: you have to use `apply` and `lambda`

In [45]:
def strip_text(df):
###BEGIN SOLUTION
    df1 = df.select_dtypes(include='object')
    df1 = df1.applymap(lambda x: x.strip() if isinstance(x, str) else "")
    df_cp = df.copy()
    for c in df1.columns:
        df_cp[c] = df1[c]
    return df_cp
###END SOLUTION
#usage
df_tweets = strip_text(df_tweets)
df_tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_coord,tweet_location,user_timezone,year,month,day,hour,minute,second
0,570306133677760513,neutral,1.0,,0.638298,virgin america,cairdin,0,@virginamerica what @dhepburn said.,,,eastern time (us & canada),2015,2,24,11,35,52
1,570301130888122368,positive,0.3486,,0.0,virgin america,jnardino,0,@virginamerica plus you've added commercials t...,,,pacific time (us & canada),2015,2,24,11,15,59
2,570301083672813571,neutral,0.6837,,0.638298,virgin america,yvonnalynn,0,@virginamerica i didn't today... must mean i n...,,lets play,central time (us & canada),2015,2,24,11,15,48
3,570301031407624196,negative,1.0,bad flight,0.7033,virgin america,jnardino,0,@virginamerica it's really aggressive to blast...,,,pacific time (us & canada),2015,2,24,11,15,36
4,570300817074462722,negative,1.0,can't tell,1.0,virgin america,jnardino,0,@virginamerica and it's a really big bad thing...,,,pacific time (us & canada),2015,2,24,11,14,45


# Task 08: Handle text: remove html tags and entities

Use the `cleanHtml()` function provided to remove all html tags or entities in the `text` column


In [46]:
###BEGIN SOLUTION
df_tweets['text'] = df_tweets['text'].apply(cleanHtml)
df_tweets.head()
###END SOLUTION

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_coord,tweet_location,user_timezone,year,month,day,hour,minute,second
0,570306133677760513,neutral,1.0,,0.638298,virgin america,cairdin,0,@virginamerica what @dhepburn said.,,,eastern time (us & canada),2015,2,24,11,35,52
1,570301130888122368,positive,0.3486,,0.0,virgin america,jnardino,0,@virginamerica plus you've added commercials t...,,,pacific time (us & canada),2015,2,24,11,15,59
2,570301083672813571,neutral,0.6837,,0.638298,virgin america,yvonnalynn,0,@virginamerica i didn't today... must mean i n...,,lets play,central time (us & canada),2015,2,24,11,15,48
3,570301031407624196,negative,1.0,bad flight,0.7033,virgin america,jnardino,0,@virginamerica it's really aggressive to blast...,,,pacific time (us & canada),2015,2,24,11,15,36
4,570300817074462722,negative,1.0,can't tell,1.0,virgin america,jnardino,0,@virginamerica and it's a really big bad thing...,,,pacific time (us & canada),2015,2,24,11,14,45


# Task 09: Handle text: remove punctuations

Use the `cleanPunc` function provided to remove all punctuations and special characters in the `text` column


In [47]:
###BEGIN SOLUTION
df_tweets['text'] = df_tweets['text'].apply(cleanPunc)
df_tweets.head()
###END SOLUTION

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_coord,tweet_location,user_timezone,year,month,day,hour,minute,second
0,570306133677760513,neutral,1.0,,0.638298,virgin america,cairdin,0,@virginamerica what @dhepburn said,,,eastern time (us & canada),2015,2,24,11,35,52
1,570301130888122368,positive,0.3486,,0.0,virgin america,jnardino,0,@virginamerica plus you ve added commercials t...,,,pacific time (us & canada),2015,2,24,11,15,59
2,570301083672813571,neutral,0.6837,,0.638298,virgin america,yvonnalynn,0,@virginamerica i didn t today must mean i n...,,lets play,central time (us & canada),2015,2,24,11,15,48
3,570301031407624196,negative,1.0,bad flight,0.7033,virgin america,jnardino,0,@virginamerica it s really aggressive to blast...,,,pacific time (us & canada),2015,2,24,11,15,36
4,570300817074462722,negative,1.0,can't tell,1.0,virgin america,jnardino,0,@virginamerica and it s a really big bad thing...,,,pacific time (us & canada),2015,2,24,11,14,45


# Task 10: Handle text: remove stop words

Use the `removeStopWords()` function provided to remove all stop words in the `text` column


In [48]:
###BEGIN SOLUTION
df_tweets['text'] = df_tweets['text'].apply(removeStopWords)
df_tweets.head()
###END SOLUTION


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,name,retweet_count,text,tweet_coord,tweet_location,user_timezone,year,month,day,hour,minute,second
0,570306133677760513,neutral,1.0,,0.638298,virgin america,cairdin,0,@virginamerica @dhepburn said,,,eastern time (us & canada),2015,2,24,11,35,52
1,570301130888122368,positive,0.3486,,0.0,virgin america,jnardino,0,@virginamerica plus added commercials expe...,,,pacific time (us & canada),2015,2,24,11,15,59
2,570301083672813571,neutral,0.6837,,0.638298,virgin america,yvonnalynn,0,@virginamerica today must mean need take...,,lets play,central time (us & canada),2015,2,24,11,15,48
3,570301031407624196,negative,1.0,bad flight,0.7033,virgin america,jnardino,0,@virginamerica really aggressive blast obno...,,,pacific time (us & canada),2015,2,24,11,15,36
4,570300817074462722,negative,1.0,can't tell,1.0,virgin america,jnardino,0,@virginamerica really big bad thing it,,,pacific time (us & canada),2015,2,24,11,14,45


In [49]:
pip install --user -U nltk

Note: you may need to restart the kernel to use updated packages.
