#### Context
This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

#### Content
It contains the following 6 fields:

target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

ids: The id of the tweet ( 2087)

date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

flag: The query (lyx). If there is no query, then this value is NO_QUERY.

user: the user that tweeted (robotickilldozr)

text: the text of the tweet (Lyx is cool)

#### Acknowledgements
The official link regarding the dataset with resources about how it was generated is here
The official paper detailing the approach is here

*Citation:* Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

In [26]:
import pandas as pd
import re
import string
import warnings
warnings.filterwarnings('ignore')

In [9]:
df=pd.read_csv('twitter.csv',encoding='ISO-8859-1',names=['target','ids','dates','flag','user','text'])

In [10]:
df.head()

Unnamed: 0,target,ids,dates,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   ids     1600000 non-null  int64 
 2   dates   1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [12]:
df['target'].uniqueue()

array([0, 4], dtype=int64)

In [14]:
df['target']=df['target'].replace(4,1)
df['target'].unique()

array([0, 1], dtype=int64)

In [15]:
dataset=df[['text','target']]

In [16]:
dataset.head()

Unnamed: 0,text,target
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0


#### Step 1Remove URLs

In [18]:
import re

In [23]:
def remove_url(text):
    pattern=re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'',text)

In [24]:
remove_url('Hello world https://drive.google.com/drive/folders/121ssib0bpfN-92VJYhQmPH-4TmqYeZC8')

'Hello world '

In [27]:
dataset['text']=dataset['text'].apply(lambda x:remove_url(x))
dataset.head()

Unnamed: 0,text,target
0,"@switchfoot - Awww, that's a bummer. You sho...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0


#### Step 2: Remove `<Tags>`

In [28]:
def remove_tags(text):
    pattern=re.compile(r'<.*?>')
    
    return pattern.sub(r'',text)

In [29]:
remove_tags('<p>Hi,<br>My name is Ubaid</p>')

'Hi,My name is Ubaid'

In [30]:
dataset['text']=dataset['text'].apply(lambda x:remove_tags(x))

#### Step 3: Remove Punctution

In [31]:
import string
string.punctuation


'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [32]:
punc=string.punctuation

def remove_punc(text):
    return text.translate(str.maketrans('','',punc))

In [33]:
remove_punc('Hi! My name is Ubaid. Shah.')

'Hi My name is Ubaid Shah'

In [34]:
dataset['text']=dataset['text'].apply(lambda x: remove_punc(x))

In [35]:
dataset.head()

Unnamed: 0,text,target
0,switchfoot Awww thats a bummer You shoulda ...,0
1,is upset that he cant update his Facebook by t...,0
2,Kenichan I dived many times for the ball Manag...,0
3,my whole body feels itchy and like its on fire,0
4,nationwideclass no its not behaving at all im ...,0


#### Step 4: Chat Word Treatment

In [38]:
slang='slang.txt'

In [39]:
with open(slang,'r') as f:
    lines=f.readlines()

In [46]:
lines[:3]

['AFAIK=As Far As I Know\n',
 'AFK=Away From Keyboard\n',
 'ASAP=As Soon As Possible\n']

In [42]:
lines[0].split('=')

['AFAIK', 'As Far As I Know\n']

In [43]:
lines[0].split('=')[0]

'AFAIK'

In [45]:
lines[0].split('=')[1][:-1]

'As Far As I Know'

In [52]:
chat_words=dict()

for i in range(len(lines)):
    chat_words[(lines[i].split('='))[0]]=(lines[i].split('='))[1][:-1]

In [53]:
chat_words

{'AFAIK': 'As Far As I Know',
 'AFK': 'Away From Keyboard',
 'ASAP': 'As Soon As Possible',
 'ATK': 'At The Keyboard',
 'ATM': 'At The Moment',
 'A3': 'Anytime, Anywhere, Anyplace',
 'BAK': 'Back At Keyboard',
 'BBL': 'Be Back Later',
 'BBS': 'Be Back Soon',
 'BFN': 'Bye For Now',
 'B4N': 'Bye For Now',
 'BRB': 'Be Right Back',
 'BRT': 'Be Right There',
 'BTW': 'By The Way',
 'B4': 'Before',
 'CU': 'See You',
 'CUL8R': 'See You Later',
 'CYA': 'See You',
 'FAQ': 'Frequently Asked Questions',
 'FC': 'Fingers Crossed',
 'FWIW': "For What It's Worth",
 'FYI': 'For Your Information',
 'GAL': 'Get A Life',
 'GG': 'Good Game',
 'GN': 'Good Night',
 'GMTA': 'Great Minds Think Alike',
 'GR8': 'Great!',
 'G9': 'Genius',
 'IC': 'I See',
 'ICQ': 'I Seek you (also a chat program)',
 'ILU': 'ILU: I Love You',
 'IMHO': 'In My Honest/Humble Opinion',
 'IMO': 'In My Opinion',
 'IOW': 'In Other Words',
 'IRL': 'In Real Life',
 'KISS': 'Keep It Simple, Stupid',
 'LDR': 'Long Distance Relationship',
 'LM

In [63]:
(dataset.iloc[7][0]).split()

['LOLTrish',
 'hey',
 'long',
 'time',
 'no',
 'see',
 'Yes',
 'Rains',
 'a',
 'bit',
 'only',
 'a',
 'bit',
 'LOL',
 'Im',
 'fine',
 'thanks',
 'hows',
 'you']

In [69]:
def chat_word_remove(text):
    new_text=[]
    for word in text.split():
        if word in chat_words:
            new_text.append(chat_words[word])
        else:
            new_text.append(word)
    return " ".join(new_text)

In [70]:
chat_word_remove((dataset.iloc[7][0]))

'LOLTrish hey long time no see Yes Rains a bit only a bit Laughing out loud Im fine thanks hows you'

In [72]:
dataset.iloc[7][0]

'LOLTrish hey  long time no see Yes Rains a bit only a bit  LOL  Im fine thanks  hows you '

In [73]:
dataset['text']=dataset['text'].apply(lambda x: chat_word_remove(x))

In [74]:
dataset.iloc[7][0]

'LOLTrish hey long time no see Yes Rains a bit only a bit Laughing out loud Im fine thanks hows you'