# Exploration of the Twitter dataset

In [2]:
import pandas as pd
from IPython.display import display
import pandas_profiling

  from .autonotebook import tqdm as notebook_tqdm
  import pandas_profiling


## twitter_dataset_full.csv

In [3]:
df_twitter_full = pd.read_csv("data/raw/twitter_dataset_full.csv", delimiter=",")
display(df_twitter_full.head())

Unnamed: 0,is_positive,id,datetime,user,message
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,joy_wolf,@Kwesidei not the whole crew


In [4]:
""" Commented due to huge size of full dataset """
#profile = pandas_profiling.ProfileReport(df_twitter_full)
#profile.to_file("data/reports/twitter_dataset_full_profile_report.html")

' Commented due to huge size of full dataset '

### Main take-aways
- __is_postive__ is a almost perfectly balanced binary class (0:799999 vs 1:800000).
- __id__ was not unique, which lead me to believe that there were some duplicate rows. Upon further inspection with drop_duplicate() I found out that this is not the case, but that some tweets (10) that are classified as both positive and negative due to their ambiguity.
- No __missing values__ in the dataset

## dataset_small_w_bart_preds.csv

In [6]:
df_twitter_small = pd.read_csv(
    "data/raw/dataset_small_w_bart_preds.csv",
    delimiter=",",
    float_precision="round_trip",
)
display(df_twitter_small.head())

Unnamed: 0,is_positive,id,datetime,user,message_clean,bart_is_positive
0,0,2323266775,Thu Jun 25 00:15:43 PDT 2009,gulti,had dream sneaked out escape into the,0.075236
1,1,2192626220,Tue Jun 16 07:18:56 PDT 2009,lpgrant,richmondgl murder train just cracked but you r...,0.003549
2,0,1824060456,Sat May 16 23:54:19 PDT 2009,starlah,sherrymain thanks for hosting your own birthda...,0.858189
3,0,2248516272,Fri Jun 19 20:44:57 PDT 2009,babymakes7,angelic rebel umm basically simple math proble...,0.226053
4,1,2050379110,Fri Jun 05 18:33:51 PDT 2009,Gelfand,middleclassgirl and there nothing wrong with that,0.498563


In [7]:
profile = pandas_profiling.ProfileReport(df_twitter_small)
profile.to_file("data/reports/dataset_small_w_bart_preds_profile_report.html")

Summarize dataset: 100%|██████████| 19/19 [00:01<00:00, 10.56it/s, Completed]                                 
Generate report structure: 100%|██████████| 1/1 [00:00<00:00,  1.02it/s]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  2.73it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 45.86it/s]


### Main take-aways
- __is_postive__ is a almost perfectly balanced binary class (0:9911 vs 1:10089).
- __id__ was not unique, because 1 tweet is classified as both positive and negative due to its ambiguity.
- the dataset has 1 __missing values__ in the field __message_clean__.

__NOTE__: By inspecting the original dataset, I concluded that the missing value in __message_clean__ is not missing data in the original message.
Due to the shortness of the message, the BART encoding probabily masked too many tokens and thus decoded an empty message.

## joint dataset
Let us create a dataset with both the original __message__ and the __message_clean__ from bart.

In [8]:
df_joint = df_twitter_small.merge(
    df_twitter_full, on=["is_positive", "id", "datetime", "user"], how="left"
)  # all but message to prevent renaming sufixes

print(df_twitter_small.shape)
print(df_joint.shape)
display(df_joint.head(2))

(20000, 6)
(20000, 7)


Unnamed: 0,is_positive,id,datetime,user,message_clean,bart_is_positive,message
0,0,2323266775,Thu Jun 25 00:15:43 PDT 2009,gulti,had dream sneaked out escape into the,0.075236,"I had a dream, it sneaked out to escape into t..."
1,1,2192626220,Tue Jun 16 07:18:56 PDT 2009,lpgrant,richmondgl murder train just cracked but you r...,0.003549,@RichmondGL Ha! Murder Train just cracked me u...


In [9]:
df_joint["datetime"] = pd.to_datetime(df_joint["datetime"])
df_joint.info()



<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 0 to 19999
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   is_positive       20000 non-null  int64         
 1   id                20000 non-null  int64         
 2   datetime          20000 non-null  datetime64[ns]
 3   user              20000 non-null  object        
 4   message_clean     19999 non-null  object        
 5   bart_is_positive  20000 non-null  float64       
 6   message           20000 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(2), object(3)
memory usage: 1.2+ MB


In [10]:
profile = pandas_profiling.ProfileReport(df_joint)
profile.to_file(
    "data/reports/dataset_small_w_bart_preds_and_original_message_profile_report.html"
)

Summarize dataset: 100%|██████████| 20/20 [00:02<00:00,  9.22it/s, Completed]                                 
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.30s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  3.51it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 126.14it/s]


In [None]:
df_joint.to_csv("data/processed/dataset_small_w_bart_preds_and_original_message.csv", index=False)

### Bart is negative
Even though the data is evenly balanced between positive and negative examples, BART predicts the messages not to be positive most of the times:

![Alt text](img/bart_is_positive_histo.png)
