## Classification
The aim of this section is to apply supervised learning methods to create a classification model to determine which tweets are an actual emergency. With this model, depending on success, could used to predict a variety of other themes, with the appropriate labelling of course. The class label have already been pre-labellel under the target, and shows 1 if the tweet has been classified as an emergency, and 0 if not. The original [file can be found here](/tweets-raw.csv).

It is important to preface that this dataset faces the classic class imbalance problem, given that emergency (1) constitutes only 18.5% of records before cleaning. Hence, there will need to be techniques applied to account for class imbalances. For instance, F1-score is more effective than Accuracy measures. We chose to oversample instead of undersample as it would mean disposing of 7k more records of non-emergencies(0), which would mean an even smaller training set after cleaning. 

Here is the order you will expect as you read the rest of this report:
1. [Data pre-processing](Classification.ipynb#1-data-preprocessing1. Data Preprocessing). The tweets are seen as is. For example, besides the actual text, emojis, vulgarities, hashtags are present with varying characters. Location range from actual values such as United States of America to "hell" or "jesus". The steps below include processing of this data. 
2. Feature engineering. The keyword column, which contain the "emergency" word in the sentence, will be added to the feature list. The sentences will be broken down into words as features using a TD-IDF approach.
3. Dataset splitting. We will need a training set, a validation set, and a test set. We have decided to employ the holdout method, which uses 2/3 of the data for model training. 
4. Model selection. We will attempt to use decision trees induction, linear regression, and baynesian classification. We may also further consider ensemble methods, random forest and boosting via AdaBoost. 
5. Training phase. Models will be applied on keyword and the fragmented sentences as features. 
6. Evaluation phase. Here, we will apply metrics such as the confusion matrix, the Receiver Operating Characteristics Curve and F1-score as previously mentioned. 


### 1. Data Preprocessing

In [23]:
! pip install Keras
! pip install tensorflow




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Collecting tensorflow
  Downloading tensorflow-2.18.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting astunparse>=1.6.0 (from tensorflow)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=24.3.25 (from tensorflow)
  Downloading flatbuffers-24.3.25-py2.py3-none-any.whl.metadata (850 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow)
  Downloading gast-0.6.0-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow)
  Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting libclang>=13.0.0 (from tensorflow)
  Downloading libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl.metad

In [53]:
import pandas as pd
import re
from tensorflow.keras.preprocessing.text import text_to_word_sequence

# Read the CSV file
df = pd.read_csv('tweetsv2.csv')

# Function taken from: https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

df['new'] = df['text.1'].apply(lambda x: text_to_word_sequence(remove_emojis(remove_punctuation(str(x)))))

print(df['new'][17])

# Display the first few rows of the dataframe
display(df.head())

['rengoku', 'sets', 'my', 'heart', 'ablaze', 'ps', 'i', 'missed', 'this', 'style', 'of', 'coloring', 'i', 'do', 'so', 'here', 'it', 'is', 'c']


Unnamed: 0.1,Unnamed: 0,keyword,location,text,text.1,url (without https://),target,new
0,0,ablaze,,"Communal violence in Bhainsa, Telangana. ""Ston...","Communal violence in Bhainsa, Telangana. ""Ston...",,1,"[communal, violence, in, bhainsa, telangana, s..."
1,1,ablaze,,Telangana: Section 144 has been imposed in Bha...,Telangana: Section 144 has been imposed in Bha...,,1,"[telangana, section, 144, has, been, imposed, ..."
2,2,ablaze,New York City,Arsonist sets cars ablaze at dealership https:...,Arsonist sets cars ablaze at dealership,t.co/gOQvyJbpVI,1,"[arsonist, sets, cars, ablaze, at, dealership]"
3,3,ablaze,"Morgantown, WV",Arsonist sets cars ablaze at dealership https:...,#SPILL!,,1,[spill]
4,4,ablaze,,"""Lord Jesus, your love brings freedom and pard...","""Lord Jesus, your love brings freedom and pard...",t.co/VlTznnPNi8,0,"[lord, jesus, your, love, brings, freedom, and..."


### 2. Feature Engineering

### 3. Dataset splitting

### 4. Model selection

### 5. Training phase

### 5. Evaluation phase