### It is recommended to have new virtual environment configured. Checkout [this](https://towardsdatascience.com/how-to-create-a-virtual-environment-and-use-it-on-jupyter-notebook-6c0b7b1cfca0) if you are not sure how to do it in notebooks

In [None]:
# packages to be installed

# !pip install pandas
# !pip install numpy
# !pip install nltk
# !pip install emoji;
# !pip install spacy
# !python -m spacy download en_core_web_sm
# !pip install pandoc
# !pip install -U jupyter_server

## Understanding the data -
Understand the data by doing following basic checks on dataframe.
- Get the shape of data
- Get the number of columns and their datatype of data
- Check how much data is null and if its okay to delete the rows
- Run the basic statistics on overall dataframe

In [None]:
# imports
import pandas as pd
import numpy as np
import emoji
import re
import string
import nltk
from nltk.corpus import stopwords
import spacy
%matplotlib inline

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
# loading data & making copy to experiment
raw_df = pd.read_csv('data/twitter-jan-mar.csv')
df = raw_df.copy()

In [None]:
# getting the basic understanding of data
df.head(10)

In [None]:
# checking columns, null values & their datatypes
df.info()

In [None]:
# rows and columns
df.shape

In [None]:
# basic statistics - below method only provides stats of numeric columns
df.describe()

## Basic Preprocessing Steps-

From the overview of the data above, we can decide the basic cleaning steps.
- We can delete the id column as it will not add any insight to the analysis
- Considering the significant rows in total, we can delete null rows
- Datatype of date column is object, we can change it to datetime to serve the analysis better
- Columns such as 'like_count' & 'retweet_count' can't be float, we can convert them into integer

In [None]:
# delete id column
df.drop(columns="id", axis=1, inplace=True)

In [None]:
# drop all null values from rows
df = df.dropna(axis=0).reset_index(drop=True)

In [None]:
# convert date column to datetime datatype
df["date"] = pd.to_datetime(df["date"])

In [None]:
# convert columns to integer
df["like_count"] = df["like_count"].astype("int32")
df["retweet_count"] = df["retweet_count"].astype("int32")

In [None]:
# validate changes
df.info()

## Analysing textual data

Now we only have the 'content' column as textual data which we need to analyse, preprocess & process in detail  

In [None]:
df['content'].value_counts()

### Understanding textual data
- It seems they are all tweets by users
- Load of times users only mentioned 'chat gpt' phrase only as a tweet
- Tweets contain loads of unclean data which can add noise to the data such as
    - punctuations
    - links
    - stopwords
    - html tags
    - emoji
    - uneven cases (capital & small)

# Preprocessing text data

### Standard preprocessing of text data includes -
 - Consistent Cases- Make all text to lowercase
 - URLs & Tags - Remove all web urls & HTML Tags
 - Stopwords - Remove all stopwords to access context in better way
 - Punctuations - Get rid of punctuation symbols in text
 - Emoji - Emoji can play a part in recognising the tone of text so convert them into texts
 - Tokenize - Convert text into consistent tokens
 - Lemmatize - Extract root words from strings

In [None]:
# Functions to preprocess the text

def convert_lower(text):
    """This function converts the input text into lowercase"""
    
    modified_text = str(text).lower()
#     print("Converting to lower")
    return modified_text

def remove_url(text):
    """This function finds the pattern of url in input text and removes it."""
    
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    modified_text = url_pattern.sub(r'', text)
#     print("Removing URLs..")
    return modified_text

def remove_tags(text):
    """This function finds the pattern of HTML tags in input text and removes it."""
    
    tags_pattern = re.compile('<.*?>')
    modified_text = tags_pattern.sub(r'', text)
#     print("Removing HTML tags..")
    return modified_text

def remove_punctuation_without_hashtags_and_mentions(text):
    """This function extracts punctuation from input text except hashtags & mentions"""
    punctuation = string.punctuation + string.digits
    # remove hashtags & mentions from punctuation list as we will need them
    filtered_punc = punctuation.replace("@","").replace("#","")
    if isinstance(text, str):
        modified_text = ''.join(char for char in text if char not in filtered_punc)
#     print("Removing punctuations..")
    return modified_text

def remove_stopwords(text):
    """This function removes the defined stopwords from input text"""
    # download stopwords from whole nltk corpora
    nltk.download('stopwords')
    # remove duplicates
    stop_words = set(stopwords.words('english'))
    
    words = text.split()
    modified_words = [word for word in words if word.lower() not in stop_words]
    modified_text = " ".join(modified_words)
#     print("Removing stopwords..")
    return modified_text

def demojize(text):
    """This function de-emojize emoticons for contextual analysis"""
    modified_text = emoji.demojize(text)
    return modified_text

def tokenize(text):
    """This function tokenize input text"""
    modified_text = nlp(text)
#     print("Tokenizing text..")
    return modified_text

def lemmatize_words(text):
    """This function lemmatize the input text"""
    doc = nlp(text)
    modified_text = " ".join([token.lemma_ if token.lemma_ != '-PRON-' else token.text for token in doc])
#     print("Tokenizing text..")
    return modified_text

def extract_mentions(df):
    """This function extract all mention accounts and stores into different column"""
    df["accounts_mentioned"] = df["content"].apply(lambda x: re.findall("(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9_]+)", x))
    return df

def extract_hashtags(df):
    """This function extract hashtags from column and creates different column"""
    df["hashtags"] = df["content"].apply(lambda x: re.findall("#(\w+)",x))
    return df
    

### Extract accounts mentioned & hashtags in tweets

In [None]:
df = extract_mentions(df)
df = extract_hashtags(df)
df.head(5)

### PreProcess Tweets

In [None]:
df["content"] = df["content"].apply(convert_lower)
df["content"] = df["content"].apply(remove_url)
df["content"] = df["content"].apply(remove_tags)
df["content"] = df["content"].apply(remove_punctuation_without_hashtags_and_mentions)
df["content"] = df["content"].apply(remove_stopwords)
df["content"] = df["content"].apply(demojize)
df["content"] = df["content"].apply(tokenize)
df["content"] = df["content"].apply(lemmatize_words)

In [None]:
df["content"] = df["content"].str.replace(":", "")

In [None]:
df.head(5)

In [None]:
df.info()

### Export the cleaned file

In [None]:
df.to_csv("data/cleaned-twitter-data.csv")