Due to the varying and unpredictable nature of language used in social media, data pre-processing is essential to assure the data quality. Five techniques widely used in tweet analysis are implemented. 

- Cleaning : words starting with the @ symbol, which refers username. Hypertext markup language (HTML) and a pointer - “#” means hashtags are removed. Emoticons (e.g. 😊, ☹) and un-recognizable UTF-8 encoding forms are converted into human readable words. Converted elongation of words (e.g. cooool, hooooot) to “reducing more than 3 subsequent occurrence of the same repeated character/letter to a sequence of three characters (e.g. coool, hooot)”, which differentiates the regular usage and emphasises the usage of the specific word.


- Slangs/abbreviations Correction: convert the most common twitter slangs or abbreviations into the original form of words using tweet slang dictionary. 


- Tokenisation:  we split each content from the dataset in the corpus with a whitespace to get individual terms, which called token (Grefenstette & Tapanainen, 1994).  Note that short forms like “I’ve”, “can’t” and n-grams created from previous data pre-processing are considered as one word. Other punctuation and special cases are not included in the token list.


- Lower casing: unify entire tokens into lower case for consistency. 


- Stop words Removal: stop words, which can be represented as language specific functional words, like pronouns, prepositions and conjunctions are removed because they don't provide any useful information in text analysis. 

In [1]:
import pandas as pd
import nltk
from nltk.stem import *
from nltk.corpus import sentiwordnet as swn

import re
import numpy as np
import pandas as pd
from pprint import pprint
import csv

# spacy for lemmatization
import spacy

import gensim
import gensim.utils
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer


In [3]:
# import dataset 
cDF = pd.read_csv("cDMARD_without_URL_TWINTver2.csv")
bDF = pd.read_csv("bDMARD_without_URL_TWINTver2.csv")

In [98]:
cDF.head()

Unnamed: 0.1,Unnamed: 0,index,date,id,tweet,location,user_id
0,0,1,2018-07-29,1023513962485174273,Antirheumatic agents (disease modifying): memb...,,1277029087
1,1,18,2018-07-19,1019720504821739520,@PaolaGhione_MD Methotrexate is still the Gold...,,152868394
2,2,20,2018-07-15,1018422977937850368,Our desi cow is only being on earth which cont...,,3949709654
3,5,33,2018-07-02,1013455474941808640,Let's do a poll question! Q2: Gold salts can c...,,511178999
4,7,40,2018-06-24,1010584989573009408,It was the first meeting after the war and was...,,33488987


### cleaning 

### slangs/abbreviation correction

In [65]:
def translator(user_string):
    user_string = user_string.split(" ")

    j = 0

    for _str in user_string:
        fileName = "slang2.txt"
        accessMode = "r"

        with open(fileName, accessMode) as myCSVfile:

            # Reading file as CSV with delimiter as "=", so that abbreviation are stored in row[0] and phrases in row[1]

            dataFromFile = csv.reader(myCSVfile, delimiter="=")

            # Removing Special Characters.
            
            _str = re.sub('[^a-zA-Z0-9-_.]', '', _str)
            
            for row in dataFromFile:
                # Check if selected word matches short forms[LHS] in text file.
                if (re.search("\.{1,}", _str)):
                    tmp = re.sub("\.*", "", _str)
                    
                    if tmp.upper() == row[0]:
                   
                    # If match found replace it with its appropriate phrase in text file.
                        user_string[j] = row[1] + "."
                        
                        
                if (re.search("\,{1,}", _str)):
                    tmp = re.sub("\,*", "", _str)
                    
                    if tmp.upper() == row[0]:
                        user_string[j] = row[1] + ","
                        
                
                if _str.upper() == row[0]:
                    # If match found replace it with its appropriate phrase in text file.
                    #print(row[0])
                    #print(row)
                    user_string[j] = row[1]

            myCSVfile.close()

        j = j + 1

    # Replacing commas with spaces for final output.
    return ' '.join(user_string)


In [66]:
# example of translator 
# tmr, tmrw -> tomorrow 
translator("also... tmr tmrw i got bad hangovers on orencia. i'm hoping that won't b the case withactemra..but idk. kids have early swim meet tmrw &i can't miss it")

"also... Tomorrow  Tomorrow  i got bad hangovers on orencia. i'm hoping that won't Back the case withactemra..but I don't know. kids have early swim meet Tomorrow  &i can't miss it"

### tokenisation 

In [4]:
tok = WordPunctTokenizer()

In [7]:
from nltk.tokenize import word_tokenize

def dataCleaning(df):
    
    for i in range(len(df)):
        temp = df.tweet[i]
        
        # remove HTML 
        soup = BeautifulSoup(temp, 'lxml')
        souped = soup.get_text()
        if(temp != souped):
            print(temp)
            
        #remove mention-username 
        souped = re.sub(r'@[A-Za-z0-9\_\-]+','',souped) 
        souped = re.sub("#", " ", souped) #remeove hashtag=letter only 
        
        # replace repeating character more than 3 times to 3 times     
        souped = re.sub(r'([a-zA-Z])\1{2,}', r'\1\1\1', souped)
        
        #emoticon handling
        souped = re.sub("[;|\:|\=|x|X].?[\)|D]{1,}", " smiling face ", souped)
        souped = re.sub("[;|\:|\=|x|X].?[\(|/]{1,}",  " sad face ", souped)
        souped = re.sub("[;|\:|\=|x|X].?[b|p|P]{1,}", " silly face ", souped)
        souped = re.sub("<3{1,}", "with love", souped)
        souped = re.sub(" w ", " with ", souped)
        
        souped = re.sub("[w|W] {0,1}\/ {0,1}[o|O] ", "without ", souped)
        
        souped = re.sub("[w|W] {0,1}\/","with", souped) 
        souped = re.sub("yr", "year", souped.lower())
        souped = re.sub("yrs", "years", souped.lower())
        
        souped = translator(souped)
        souped = re.sub(" {2,}", " ", souped)        
        df.tweet[i] = souped.strip()

    return df 
        

In [121]:
cDF.head()

Unnamed: 0.1,Unnamed: 0,index,date,id,tweet,location,user_id
0,0,1,2018-07-29,1023513962485174273,Antirheumatic agents ( disease modifying ): me...,,1277029087
1,1,18,2018-07-19,1019720504821739520,Methotrexate is still the Gold Standard Treatm...,,152868394
2,2,20,2018-07-15,1018422977937850368,Our desi cow is only being on earth which cont...,,3949709654
3,5,33,2018-07-02,1013455474941808640,Let ' s do a poll question ! Q2 : Gold salts c...,,511178999
4,7,40,2018-06-24,1010584989573009408,It Wait a second the first meeting after the w...,,33488987


In [8]:
dataCleaning(cDF)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


@kshevaun_sings @L_Hizzy Best of luck! It works for a lot of people! I personally took methotrexate for as long as I did so I could get insurance to cover my move to biologics. I would ask your doctor about Benlysta, a biologic medicine specifically for lupus.  <URL>


Unnamed: 0.1,Unnamed: 0,index,date,id,tweet,location,user_id
0,0,1,2018-07-29,1023513962485174273,antirheumatic agents (disease modifying): memb...,,1277029087
1,1,18,2018-07-19,1019720504821739520,methotrexate is still the gold standard treatm...,,152868394
2,2,20,2018-07-15,1018422977937850368,our desi cow is only being on earth which cont...,,3949709654
3,5,33,2018-07-02,1013455474941808640,let's do a poll question! q2: gold salts can c...,,511178999
4,7,40,2018-06-24,1010584989573009408,it was the first meeting after the war and was...,,33488987
5,8,44,2018-06-19,1008828112380092416,"true gold is scarce, give me gold the lord god...",,884958231558328321
6,9,45,2018-06-19,1008732826756251649,[medicine]methotrexate is the csdmard of choic...,,184645342
7,11,47,2018-06-17,1008221880879284224,silver fans are the most precious gold salts t...,,273390607
8,13,61,2018-06-10,1005560145420488704,bitch i might go neon white and get injected w...,,720385476926971908
9,14,62,2018-06-09,1005357127689474049,"some arthritis medications contain gold salts,...",,988760211476402176


In [70]:
cDF.to_csv("cDMARD without Url after data processing ver5.csv")

In [71]:
dataCleaning(bDF)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0.1,Unnamed: 0,index,date,id,tweet,location,user_id
0,7,3695,2018-07-30,1023665625732997120,staffing problems! apparently i shouldn’t have...,,2.956904e+08
1,13,3704,2018-07-30,1023589447240171522,it’s a difficult choice. i was stopped because...,,2.697912e+08
2,14,3705,2018-07-30,1023588362773585921,"#myibdhistory told for 7 years had ibs, dx cro...",,2.956904e+08
3,20,3711,2018-07-29,1023529270914756609,of course. i was on it for eight years. becaus...,,3.303199e+09
4,21,3712,2018-07-29,1023526202580127744,fingers crossed for the infliximab :crossed_fi...,,2.612927e+08
5,22,3713,2018-07-29,1023522245099118592,first time out of the house properly in almost...,,1.663716e+09
6,23,3714,2018-07-29,1023493049505263616,"if it's not researched, how do you knowledge i...",,7.583030e+17
7,25,3717,2018-07-29,1023385292483452928,sorry to hear that gabe. let’s hope that your ...,,4.093497e+08
8,26,3719,2018-07-29,1023336213862932481,2015 symptoms first started.2016 diagnosed wit...,,3.238341e+08
9,27,3720,2018-07-29,1023331773378424832,ok. adalimumab is a game changer for people wi...,,7.583030e+17


In [72]:
#cDF.to_csv("cDMARD without Url after data processing.csv")
bDF.to_csv("bDMARD without URL after data processing ver5.csv")