# Cleaning
In this notebook we will do some more deeper cleaning on our train data. And we will store the cleaned data to be used in the LastVersion notebook.

## 1. Import

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import collections  as mc
%load_ext autoreload
%autoreload 2
import pandas as pd 
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler
sns.set_style("white")

import string
import re

##2. Loading Data

Below is how the dataset looks like

In [None]:

df_train = pd.read_csv("https://raw.githubusercontent.com/sarrab/DMML2020_COOP/main/data/training_data.csv")
df_train.head()
df_test = pd.read_csv("https://raw.githubusercontent.com/sarrab/DMML2020_COOP/main/data/test_data.csv")


# Improvement- A New Approach With Better Assumptions 

## Cleaning- Rethought

In [None]:
#this provides a social tokenizer, word segmentation and spell correction
!pip install ekphrasis

Collecting ekphrasis
[?25l  Downloading https://files.pythonhosted.org/packages/92/e6/37c59d65e78c3a2aaf662df58faca7250eb6b36c559b912a39a7ca204cfb/ekphrasis-0.5.1.tar.gz (80kB)
[K     |████                            | 10kB 19.4MB/s eta 0:00:01[K     |████████▏                       | 20kB 19.8MB/s eta 0:00:01[K     |████████████▎                   | 30kB 12.7MB/s eta 0:00:01[K     |████████████████▍               | 40kB 13.4MB/s eta 0:00:01[K     |████████████████████▌           | 51kB 11.0MB/s eta 0:00:01[K     |████████████████████████▌       | 61kB 11.2MB/s eta 0:00:01[K     |████████████████████████████▋   | 71kB 10.5MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 4.3MB/s 
Collecting colorama
  Downloading https://files.pythonhosted.org/packages/44/98/5b86278fbbf250d239ae0ecb724f8572af1c91f4a11edf4d36a206189440/colorama-0.4.4-py2.py3-none-any.whl
Collecting ujson
[?25l  Downloading https://files.pythonhosted.org/packages/f1/84/e039c6ffc6603f2dfe966

####Filling up the missing data With Tags

In our previous cleaning, we dropped missing values. However, with such a small dataset, we realised that it was not a very good idea. Instead now, we want will the missing values with some specific tags which we have defined.

In [None]:
df_train_copy_9["location"].fillna("<no-location>", inplace = True) 
df_test_copy["location"].fillna("<no-location>", inplace = True)

df_train_copy_9["keyword"].fillna("<no-keyword>", inplace = True) 
df_test_copy["keyword"].fillna("<no-keyword>", inplace = True)

####This cleans the http-url

In [None]:
# this takes care of the url
import re

df_train_copy_9.text = df_train_copy_9.text.apply(lambda x: re.sub(r'https?://\S+|www\.\S+', '', x, flags = re.MULTILINE))
df_train_copy_9.text.values

array(['Black Eye 9: A space battle occurred at Star O784 involving 3 fleets totaling 3945 ships with 17 destroyed',
       '#world FedEx no longer to transport bioterror germs in wake of anthrax lab mishaps  ',
       'Reality Training: Train falls off elevated tracks during windstorm  #Paramedic #EMS',
       ...,
       "Hollywood Movie About Trapped Miners Released in Chile: 'The 33' Hollywood movie about trapped miners starring... ",
       'Friendly reminder that the only country to ever use nuclear weapons is the U.S. And it was against civilians. ',
       'Buildings are on fire and they have time for a business meeting #TheStrain'],
      dtype=object)

**This takes care of the special character**

In [None]:

df_train_copy_9.text = df_train_copy_9.text.apply(lambda x: re.sub(r"[^a-zA-Z:.'!?,#@_\s]+", '', x, flags = re.MULTILINE))
df_train_copy_9.keyword = df_train_copy_9.keyword.apply(lambda x: re.sub(r"[^a-zA-Z:.'!?,#@_\s]+", '', str(x), flags = re.MULTILINE))
df_train_copy_9.location = df_train_copy_9.location.apply(lambda x: re.sub(r"[^a-zA-Z:.'!?,#@_\s]+", '', str(x), flags = re.MULTILINE))
df_train_copy_9.text.values

array(['Black Eye : A space battle occurred at Star O involving  fleets totaling  ships with  destroyed',
       '#world FedEx no longer to transport bioterror germs in wake of anthrax lab mishaps  ',
       'Reality Training: Train falls off elevated tracks during windstorm  #Paramedic #EMS',
       ...,
       "Hollywood Movie About Trapped Miners Released in Chile: 'The ' Hollywood movie about trapped miners starring... ",
       'Friendly reminder that the only country to ever use nuclear weapons is the U.S. And it was against civilians. ',
       'Buildings are on fire and they have time for a business meeting #TheStrain'],
      dtype=object)

In [None]:
df_test_copy.text = df_test_copy.text.apply(lambda x: re.sub(r'https?://\S+|www\.\S+', '', x, flags = re.MULTILINE))
df_test_copy.text.values 

array(['Crptotech tsunami and banks.\r\n  #Banking #tech #bitcoing #blockchain',
       "I'm that traumatised that I can't even spell properly! Excuse the typos!",
       '@foxnewsvideo @AIIAmericanGirI @ANHQDC So ... where are the rioters looters and burning buildings????  WHITE LIVES MATTER!!!!!!',
       ...,
       'Eruption of Indonesian volcano sparks transport chaos: In this picture done from video Mount Raung in\x89Û_  ?',
       'Never let fear get in the way of achieving your dreams! #deltachildren #instaquote #quoteoftheday #Disney #WaltDisney ',
       'wowo--=== 12000 Nigerian refugees repatriated from Cameroon'],
      dtype=object)

In [None]:
df_test_copy.text = df_test_copy.text.apply(lambda x: re.sub(r"[^a-zA-Z:.'!?,#@_\s]+", '', x, flags = re.MULTILINE))
df_test_copy.keyword = df_test_copy.keyword.apply(lambda x: re.sub(r"[^a-zA-Z:.'!?,#@_\s]+", '', str(x), flags = re.MULTILINE))
df_test_copy.location = df_test_copy.location.apply(lambda x: re.sub(r"[^a-zA-Z:.'!?,#@_\s]+", '', str(x), flags = re.MULTILINE))
df_test_copy.text.values

array(['Crptotech tsunami and banks.\r\n  #Banking #tech #bitcoing #blockchain',
       "I'm that traumatised that I can't even spell properly! Excuse the typos!",
       '@foxnewsvideo @AIIAmericanGirI @ANHQDC So ... where are the rioters looters and burning buildings????  WHITE LIVES MATTER!!!!!!',
       ...,
       'Eruption of Indonesian volcano sparks transport chaos: In this picture done from video Mount Raung in_  ?',
       'Never let fear get in the way of achieving your dreams! #deltachildren #instaquote #quoteoftheday #Disney #WaltDisney ',
       'wowo  Nigerian refugees repatriated from Cameroon'], dtype=object)

In [None]:
df_train_copy_9

Unnamed: 0,id,keyword,location,text,target
0,3738,destroyed,USA,Black Eye : A space battle occurred at Star O ...,0
1,853,bioterror,nolocation,#world FedEx no longer to transport bioterror ...,0
2,10540,windstorm,"Palm Beach County, FL",Reality Training: Train falls off elevated tra...,1
3,5988,hazardous,USA,#Taiwan Grace: expect that large rocks trees m...,1
4,6328,hostage,Australia,New ISIS Video: ISIS Threatens to Behead Croat...,1
...,...,...,...,...,...
6466,4377,earthquake,ARGENTINA,#Earthquake #Sismo M . km E of Anchorage Alas...,1
6467,3408,derail,nolocation,@EmiiliexIrwin Totally agree.She is and know ...,0
6468,9794,trapped,nolocation,Hollywood Movie About Trapped Miners Released ...,1
6469,10344,weapons,BeirutToronto,Friendly reminder that the only country to eve...,1


###Dealing with abbreviations

Tweets by nature contain a lot abbreviations, thus we expand them. 

In [None]:

def clean_tw(tweet):

    #correct some acronyms while we are at it
    tweet = re.sub(r"tnwx", "Tennessee Weather", tweet)
    tweet = re.sub(r"azwx", "Arizona Weather", tweet)  
    tweet = re.sub(r"alwx", "Alabama Weather", tweet)
    tweet = re.sub(r"wordpressdotcom", "wordpress", tweet)      
    tweet = re.sub(r"gawx", "Georgia Weather", tweet)  
    tweet = re.sub(r"scwx", "South Carolina Weather", tweet)  
    tweet = re.sub(r"cawx", "California Weather", tweet)
    tweet = re.sub(r"usNWSgov", "United States National Weather Service", tweet) 
    tweet = re.sub(r"MH370", "Malaysia Airlines Flight 370", tweet)
    tweet = re.sub(r"okwx", "Oklahoma City Weather", tweet)
    tweet = re.sub(r"arwx", "Arkansas Weather", tweet)  
    tweet = re.sub(r"lmao", "laughing my ass off", tweet)  
    tweet = re.sub(r"amirite", "am I right", tweet)
    
    #and some typos/abbreviations
    tweet = re.sub(r"w/e", "whatever", tweet)
    tweet = re.sub(r"w/", "with", tweet)
    tweet = re.sub(r"USAgov", "USA government", tweet)
    tweet = re.sub(r"recentlu", "recently", tweet)
    tweet = re.sub(r"Ph0tos", "Photos", tweet)
    tweet = re.sub(r"exp0sed", "exposed", tweet)
    tweet = re.sub(r"<3", "love", tweet)
    tweet = re.sub(r"amageddon", "armageddon", tweet)
    tweet = re.sub(r"Trfc", "Traffic", tweet)
    tweet = re.sub(r"WindStorm", "Wind Storm", tweet)
    tweet = re.sub(r"16yr", "16 year", tweet)
    tweet = re.sub(r"TRAUMATISED", "traumatized", tweet)
    
    #hashtags and usernames
    tweet = re.sub(r"IranDeal", "Iran Deal", tweet)
    tweet = re.sub(r"ArianaGrande", "Ariana Grande", tweet)
    tweet = re.sub(r"camilacabello97", "camila cabello", tweet) 
    tweet = re.sub(r"RondaRousey", "Ronda Rousey", tweet)     
    tweet = re.sub(r"MTVHottest", "MTV Hottest", tweet)
    tweet = re.sub(r"TrapMusic", "Trap Music", tweet)
    tweet = re.sub(r"ProphetMuhammad", "Prophet Muhammad", tweet)
    tweet = re.sub(r"PantherAttack", "Panther Attack", tweet)
    tweet = re.sub(r"StrategicPatience", "Strategic Patience", tweet)
    tweet = re.sub(r"socialnews", "social news", tweet)
    tweet = re.sub(r"IDPs:", "Internally Displaced People :", tweet)
    tweet = re.sub(r"ArtistsUnited", "Artists United", tweet)
    tweet = re.sub(r"ClaytonBryant", "Clayton Bryant", tweet)
    tweet = re.sub(r"jimmyfallon", "jimmy fallon", tweet)
    tweet = re.sub(r"justinbieber", "justin bieber", tweet)  
    tweet = re.sub(r"Time2015", "Time 2015", tweet)
    tweet = re.sub(r"djicemoon", "dj icemoon", tweet)
    tweet = re.sub(r"LivingSafely", "Living Safely", tweet)
    tweet = re.sub(r"FIFA16", "Fifa 2016", tweet)
    tweet = re.sub(r"thisiswhywecanthavenicethings", "this is why we cannot have nice things", tweet)
    tweet = re.sub(r"bbcnews", "bbc news", tweet)
    tweet = re.sub(r"UndergroundRailraod", "Underground Railraod", tweet)
    tweet = re.sub(r"c4news", "c4 news", tweet)
    tweet = re.sub(r"MUDSLIDE", "mudslide", tweet)
    tweet = re.sub(r"NoSurrender", "No Surrender", tweet)
    tweet = re.sub(r"NotExplained", "Not Explained", tweet)
    tweet = re.sub(r"greatbritishbakeoff", "great british bake off", tweet)
    tweet = re.sub(r"LondonFire", "London Fire", tweet)
    tweet = re.sub(r"KOTAWeather", "KOTA Weather", tweet)
    tweet = re.sub(r"LuchaUnderground", "Lucha Underground", tweet)
    tweet = re.sub(r"KOIN6News", "KOIN 6 News", tweet)
    tweet = re.sub(r"LiveOnK2", "Live On K2", tweet)
    tweet = re.sub(r"9NewsGoldCoast", "9 News Gold Coast", tweet)
    tweet = re.sub(r"nikeplus", "nike plus", tweet)
    tweet = re.sub(r"david_cameron", "David Cameron", tweet)
    tweet = re.sub(r"peterjukes", "Peter Jukes", tweet)
    tweet = re.sub(r"MikeParrActor", "Michael Parr", tweet)
    tweet = re.sub(r"4PlayThursdays", "Foreplay Thursdays", tweet)
    tweet = re.sub(r"TGF2015", "Tontitown Grape Festival", tweet)
    tweet = re.sub(r"realmandyrain", "Mandy Rain", tweet)
    tweet = re.sub(r"GraysonDolan", "Grayson Dolan", tweet)
    tweet = re.sub(r"ApolloBrown", "Apollo Brown", tweet)
    tweet = re.sub(r"saddlebrooke", "Saddlebrooke", tweet)
    tweet = re.sub(r"TontitownGrape", "Tontitown Grape", tweet)
    tweet = re.sub(r"AbbsWinston", "Abbs Winston", tweet)
    tweet = re.sub(r"ShaunKing", "Shaun King", tweet)
    tweet = re.sub(r"MeekMill", "Meek Mill", tweet)
    tweet = re.sub(r"TornadoGiveaway", "Tornado Giveaway", tweet)
    tweet = re.sub(r"GRupdates", "GR updates", tweet)
    tweet = re.sub(r"SouthDowns", "South Downs", tweet)
    tweet = re.sub(r"braininjury", "brain injury", tweet)
    tweet = re.sub(r"auspol", "Australian politics", tweet)
    tweet = re.sub(r"PlannedParenthood", "Planned Parenthood", tweet)
    tweet = re.sub(r"calgaryweather", "Calgary Weather", tweet)
    tweet = re.sub(r"weallheartonedirection", "we all heart one direction", tweet)
    tweet = re.sub(r"edsheeran", "Ed Sheeran", tweet)
    tweet = re.sub(r"TrueHeroes", "True Heroes", tweet)
    tweet = re.sub(r"ComplexMag", "Complex Magazine", tweet)
    tweet = re.sub(r"TheAdvocateMag", "The Advocate Magazine", tweet)
    tweet = re.sub(r"CityofCalgary", "City of Calgary", tweet)
    tweet = re.sub(r"EbolaOutbreak", "Ebola Outbreak", tweet)
    tweet = re.sub(r"SummerFate", "Summer Fate", tweet)
    tweet = re.sub(r"RAmag", "Royal Academy Magazine", tweet)
    tweet = re.sub(r"offers2go", "offers to go", tweet)
    tweet = re.sub(r"ModiMinistry", "Modi Ministry", tweet)
    tweet = re.sub(r"TAXIWAYS", "taxi ways", tweet)
    tweet = re.sub(r"Calum5SOS", "Calum Hood", tweet)
    tweet = re.sub(r"JamesMelville", "James Melville", tweet)
    tweet = re.sub(r"JamaicaObserver", "Jamaica Observer", tweet)
    tweet = re.sub(r"TweetLikeItsSeptember11th2001", "Tweet like it is september 11th 2001", tweet)
    tweet = re.sub(r"cbplawyers", "cbp lawyers", tweet)
    tweet = re.sub(r"fewmoretweets", "few more tweets", tweet)
    tweet = re.sub(r"BlackLivesMatter", "Black Lives Matter", tweet)
    tweet = re.sub(r"NASAHurricane", "NASA Hurricane", tweet)
    tweet = re.sub(r"onlinecommunities", "online communities", tweet)
    tweet = re.sub(r"humanconsumption", "human consumption", tweet)
    tweet = re.sub(r"Typhoon-Devastated", "Typhoon Devastated", tweet)
    tweet = re.sub(r"Meat-Loving", "Meat Loving", tweet)
    tweet = re.sub(r"facialabuse", "facial abuse", tweet)
    tweet = re.sub(r"LakeCounty", "Lake County", tweet)
    tweet = re.sub(r"BeingAuthor", "Being Author", tweet)
    tweet = re.sub(r"withheavenly", "with heavenly", tweet)
    tweet = re.sub(r"thankU", "thank you", tweet)
    tweet = re.sub(r"iTunesMusic", "iTunes Music", tweet)
    tweet = re.sub(r"OffensiveContent", "Offensive Content", tweet)
    tweet = re.sub(r"WorstSummerJob", "Worst Summer Job", tweet)
    tweet = re.sub(r"HarryBeCareful", "Harry Be Careful", tweet)
    tweet = re.sub(r"NASASolarSystem", "NASA Solar System", tweet)
    tweet = re.sub(r"animalrescue", "animal rescue", tweet)
    tweet = re.sub(r"KurtSchlichter", "Kurt Schlichter", tweet)
    tweet = re.sub(r"Throwingknifes", "Throwing knives", tweet)
    tweet = re.sub(r"GodsLove", "God's Love", tweet)
    tweet = re.sub(r"bookboost", "book boost", tweet)
    tweet = re.sub(r"ibooklove", "I book love", tweet)
    tweet = re.sub(r"NestleIndia", "Nestle India", tweet)
    tweet = re.sub(r"realDonaldTrump", "Donald Trump", tweet)
    tweet = re.sub(r"DavidVonderhaar", "David Vonderhaar", tweet)
    tweet = re.sub(r"CecilTheLion", "Cecil The Lion", tweet)
    tweet = re.sub(r"weathernetwork", "weather network", tweet)
    tweet = re.sub(r"GOPDebate", "GOP Debate", tweet)
    tweet = re.sub(r"RickPerry", "Rick Perry", tweet)
    tweet = re.sub(r"frontpage", "front page", tweet)
    tweet = re.sub(r"NewsInTweets", "News In Tweets", tweet)
    tweet = re.sub(r"ViralSpell", "Viral Spell", tweet)
    tweet = re.sub(r"til_now", "until now", tweet)
    tweet = re.sub(r"volcanoinRussia", "volcano in Russia", tweet)
    tweet = re.sub(r"ZippedNews", "Zipped News", tweet)
    tweet = re.sub(r"MicheleBachman", "Michele Bachman", tweet)
    tweet = re.sub(r"53inch", "53 inch", tweet)
    tweet = re.sub(r"KerrickTrial", "Kerrick Trial", tweet)
    tweet = re.sub(r"abstorm", "Alberta Storm", tweet)
    tweet = re.sub(r"Beyhive", "Beyonce hive", tweet)
    tweet = re.sub(r"RockyFire", "Rocky Fire", tweet)
    tweet = re.sub(r"Listen/Buy", "Listen / Buy", tweet)
    tweet = re.sub(r"ArtistsUnited", "Artists United", tweet)
    tweet = re.sub(r"ENGvAUS", "England vs Australia", tweet)
    tweet = re.sub(r"TheStrain", "The Strain", tweet)
    tweet = re.sub(r"bioterror", "bio terror", tweet)
    tweet = re.sub(r"transpo", "transport", tweet)
    tweet = re.sub(r"ScottWalker", "Scott Walker", tweet)

    return tweet 


df_train_copy_9.text = df_train_copy_9.text.apply(clean_tw) 
df_test_copy.text = df_test_copy.text.apply(clean_tw) 
df_test_copy.text

0       Crptotech tsunami and banks.\r\n  #Banking #te...
1       I'm that traumatised that I can't even spell p...
2       @foxnewsvideo @AIIAmericanGirI @ANHQDC So ... ...
3       Me watching Law amp Order IB: @sauldale Vine b...
4                       Papi absolutely crushed that ball
                              ...                        
1137    @ItsQueenBaby I'm at work it's a bunch of ppl ...
1138    #?? #?? #??? #??? Suicide bomber kills  in Sau...
1139    Eruption of Indonesian volcano sparks transpor...
1140    Never let fear get in the way of achieving you...
1141    wowo  Nigerian refugees repatriated from Cameroon
Name: text, Length: 1142, dtype: object

###Ekphrasis: Dealing with Social Text 

It is a tool which preprocesses social text from social networks, such as Twitter or Facebook. It deals with word contractions, spell correction using a twitter related corpus. It also annotates hashtags and translates emojis to meaningful expressions.  


https://github.com/cbaziotis/ekphrasis

In [None]:
from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from ekphrasis.dicts.emoticons import emoticons
import nltk


text_processor = TextPreProcessor(
    # terms that will be normalized
    normalize=['email', 'percent', 'money', 'phone', 'url', 'user',
        'time', 'date', 'number'],
    # terms that will be annotated
    annotate={"hashtag","elongated", "repeated",
        'emphasis', 'censored'},
    fix_html=True,  # fix HTML tokens
    
    # corpus from which the word statistics are going to be used 
    # for word segmentation 
    segmenter="twitter", 
    
    # corpus from which the word statistics are going to be used 
    # for spell correction
    corrector="twitter", 
    
    unpack_hashtags=True,  # perform word segmentation on hashtags
    unpack_contractions=True,  # Unpack contractions (can't -> can not)
    spell_correct_elong=False,  # spell correction for elongated words
    
    # select a tokenizer. You can use SocialTokenizer, or pass your own
    # the tokenizer, should take as input a string and return a list of tokens
    tokenizer=SocialTokenizer(lowercase=True).tokenize,
    
    # list of dictionaries, for replacing tokens extracted from the text,
    # with other expressions. You can pass more than one dictionaries.
    dicts=[emoticons]
)

i = 0
for review in df_train_copy_9.text.values:
    df_train_copy_9['text'].loc[i]=" ".join(text_processor.pre_process_doc(review))
    i += 1
# df_train_copy_9['text']

j = 0
for review in df_test_copy.text.values:
    df_test_copy['text'].loc[j]=" ".join(text_processor.pre_process_doc(review))
    j += 1
df_test_copy['text'].values


Word statistics files not found!
Downloading... done!
Unpacking... done!
Reading twitter - 1grams ...
generating cache file for faster loading...
reading ngrams /root/.ekphrasis/stats/twitter/counts_1grams.txt
Reading twitter - 2grams ...
generating cache file for faster loading...
reading ngrams /root/.ekphrasis/stats/twitter/counts_2grams.txt
Reading twitter - 1grams ...


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


array(['crptotech tsunami and banks . <hashtag> banking </hashtag> <hashtag> tech </hashtag> <hashtag> bitcoin g </hashtag> <hashtag> blockchain </hashtag>',
       'i am that traumatised that i can not even spell properly ! excuse the typos !',
       '<user> <user> <user> so . <repeated> where are the rioters looters and burning buildings ? <repeated> white lives matter ! <repeated>',
       ...,
       'eruption of indonesian volcano sparks transportrt chaos : in this picture done from video mount raung in_ ?',
       'never let fear get in the way of achieving your dreams ! <hashtag> delta children </hashtag> <hashtag> insta quote </hashtag> <hashtag> quote of the day </hashtag> <hashtag> disney </hashtag> <hashtag> walt disney </hashtag>',
       'wowo nigerian refugees repatriated from cameroon'], dtype=object)

In [None]:
df_train_copy_9.to_csv('./cleaned_data.csv')