# Data Preprocessing

## Twitter data

In [48]:
# libraries
import pandas as pd
import numpy as np
import re

In [49]:
df = pd.read_csv("elonmusk_tweets.csv")
df.head()

Unnamed: 0,id,created_at,text
0,849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'
1,848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6 https://t.co/qQcTqkzgMl"""
2,848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
3,848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...'
4,848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat gas fracking. It's basically dead."""


In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2819 entries, 0 to 2818
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          2819 non-null   int64 
 1   created_at  2819 non-null   object
 2   text        2819 non-null   object
dtypes: int64(1), object(2)
memory usage: 66.2+ KB


### objective is to preprocess the data. so that it could be trained in a model based on the requirement. 

In [51]:
pd.set_option('display.colheader_justify', 'left') 
pd.set_option('display.max_colwidth', None)

In [52]:
print(df['text'][0:5])

0                                                                                              b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'
1    b"@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6 https://t.co/qQcTqkzgMl"
2                                                                                                                 b'@waltmossberg @mims @defcon_5 Et tu, Walt?'
3                                                                                                                           b'Stormy weather in Shortville ...'
4                                                                             b"@DaveLeeBBC @verge Coal is dying due to nat gas fracking. It's basically dead."
Name: text, dtype: object


### So we can see here that there are some ambiguity in the data, which will increase the complexity for the model to understand context of the data. Especially when it comes to sentiment analysis (viz the process of understanding the digital text to determine the tone of the message. whether it is positive, negative or neutral).

# Data cleaning

### when it comes to sentiment analysis of twitter data, we should remove URL/mail id, and hash tags. since they do not give any information when we try to analyze text from words, by doing this they take less space.

## Example

In [53]:
p = df['text'][2816]
remove_hp = r"https?://\S+|www\.\S+"

### URLs are complex, creating a regular expression for all formats is a challenging task. This regular expression is designed to match both HTTP/HTTPS URLs and URLs starting with "www.". The | allows it to capture either of these patterns in a given text.

In [54]:
# example
df['text'][0]

"b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'"

In [55]:
x = re.sub(remove_hp,'',df['text'][0]) # here we are replacing the hyperlink with null empty
print(x)

b'And so the robots spared humanity ... 


In [56]:
a = 'hi, please check this link https://xyz.com'
b = 'hi, please check this link http://xyz.com'
c = 'hi, please check this link www.xyz.com'
a1 = re.sub(remove_hp,'',a)
a2 = re.sub(remove_hp,'',b)
a3 = re.sub(remove_hp,'',c)
print('\n',a1,'\n',a2,'\n',a3)


 hi, please check this link  
 hi, please check this link  
 hi, please check this link 


## Here we can see from above example that our regular expression is able to remove given URL

### Removing the hashtags will help model to focus on content of the text than metadata. Which will help in analysing based on expressed opinions.additionally removing them will reduce noise. Same goes with @, and also it will keep the conversation private, especially if the content mentions of specific individual didnt give consent of beeing part of sentimental analysis.

### However drawbacks are loss of context, Impact on Brand incase if they are asking public opinion on specific brand

#### since we dont have specific targets, we will be removing these anamolies for reducing noises 

In [57]:
df['text'][1]

'b"@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that\'s irr\\xe2\\x80\\xa6 https://t.co/qQcTqkzgMl"'

In [58]:
regex_hash= r'#\S+|@\S+'
x = re.sub(regex_hash,'',df['text'][1]) # here we are replacing the hyperlink with null empty
print(x)

b"    Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6 https://t.co/qQcTqkzgMl"


In [59]:
# removing b from each string
regex_beginning = r"^b['\"]|[/'\"]$"
p = re.sub(regex_beginning, '', df['text'][3])
print(p)


Stormy weather in Shortville ...


In [60]:
import nltk                                # Python library for NLP
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings

## removing puntuation will help in simplifying the text, improve the tokenization and reduces dimentionality
## However Punctuation can convey sentiment, emphasis, or tone. For sentiment analysis tasks, preserving certain punctuation marks

In [61]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

## stop words are coomon words which is used in a sentece
## These words are considered to be of little value in terms of conveying meaningful information about the content of the text. Stop words are typically very frequent and do not contribute much to the overall meaning of a sentence.

In [39]:
# download the stopwords from NLTK
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yeshw\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [40]:
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

In [42]:
def preprocess():
    df['tokenized_text'] = df['text'][0:5].apply(lambda x: TweetTokenizer().tokenize(x))

0                                                                                                                         [b'And, so, the, robots, spared, humanity, ..., https://t.co/v7JUJQWfCv, ']
1       [@ForIn2020, @waltmossberg, @mims, @defcon_5, Exactly, ., Tesla, is, absurdly, overvalued, if, based, on, the, past, ,, but, that's, irr, \, xe2, \, x80, \, xa6, https://t.co/qQcTqkzgMl, "]
2                                                                                                                                      [b, ', @waltmossberg, @mims, @defcon_5, Et, tu, ,, Walt, ?, ']
3                                                                                                                                                         [b'Stormy, weather, in, Shortville, ..., ']
4                                                                                           [b, ", @DaveLeeBBC, @verge, Coal, is, dying, due, to, nat, gas, fracking, ., It's, basically, dead, ., "]
          

In [47]:
df

Unnamed: 0,id,created_at,text,tokenized_text
0,849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv',"[b'And, so, the, robots, spared, humanity, ..., https://t.co/v7JUJQWfCv, ']"
1,848988730585096192,2017-04-03 20:01:01,"@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6 https://t.co/qQcTqkzgMl""","[@ForIn2020, @waltmossberg, @mims, @defcon_5, Exactly, ., Tesla, is, absurdly, overvalued, if, based, on, the, past, ,, but, that's, irr, \, xe2, \, x80, \, xa6, https://t.co/qQcTqkzgMl, ""]"
2,848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'","[b, ', @waltmossberg, @mims, @defcon_5, Et, tu, ,, Walt, ?, ']"
3,848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...',"[b'Stormy, weather, in, Shortville, ..., ']"
4,848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat gas fracking. It's basically dead.""","[b, "", @DaveLeeBBC, @verge, Coal, is, dying, due, to, nat, gas, fracking, ., It's, basically, dead, ., ""]"
...,...,...,...,...
2814,142881284019060736,2011-12-03 08:22:07,b'That was a total non sequitur btw',
2815,142880871391838208,2011-12-03 08:20:28,"b'Great Voltaire quote, arguably better than Twain. Hearing news of his own death, Voltaire replied the reports were true, only premature.'",
2816,142188458125963264,2011-12-01 10:29:04,b'I made the volume on the Model S http://t.co/wMCnT53M go to 11. Now I just need to work in a miniature Stonehenge...',
2817,142179928203460608,2011-12-01 09:55:11,"b""Went to Iceland on Sat to ride bumper cars on ice! No, not the country, Vlad's rink in Van Nuys. Awesome family fun :) http://t.co/rBQXJ9IT""",


In [62]:
def clean_text(text):
    # Remove URLs
    text = re.sub(r"https?://\S+|www\.\S+", '', text)
    
    # Remove hashtags
    text = re.sub(r'#\S+|@\S+', '', text)
    
    # Remove mentions
    text = re.sub(r"^b['\"]|[/'\"]$", '', text)
    
    return text.strip()

In [96]:
df1 = df['text'][1:10].apply(clean_text)
df1 = pd.DataFrame({'t_text': df1})

In [97]:
df1

Unnamed: 0,t_text
1,"Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6"
2,"Et tu, Walt?"
3,Stormy weather in Shortville ...
4,Coal is dying due to nat gas fracking. It's basically dead.
5,It's just a helicopter in helicopter's clothing
6,It won't matter
7,Pretty good
8,"Why did we waste so much time developing silly rockets? Damn you, aliens! So obtuse! You have all this crazy tech, but can't speak English!?"
9,Technology breakthrough: turns out chemtrails are actually a message from time-traveling aliens describing the secret of teleportation


In [117]:
#df2 = df1.apply(lambda x: TweetTokenizer().tokenize(x))
#print(df2)
def decode_utf8(column):
    return column.apply(lambda text: text.encode('latin1').decode('utf-8'))

# Apply the decoding function to the 'text' column
df1['decoded_text'] = decode_utf8(df1['t_text'])

In [118]:
df1

Unnamed: 0,t_text,tokenized_text,decoded_text
1,"Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6","[Exactly, ., Tesla, is, absurdly, overvalued, if, based, on, the, past, ,, but, that's, irr, \, xe2, \, x80, \, xa6]","Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6"
2,"Et tu, Walt?","[Et, tu, ,, Walt, ?]","Et tu, Walt?"
3,Stormy weather in Shortville ...,"[Stormy, weather, in, Shortville, ...]",Stormy weather in Shortville ...
4,Coal is dying due to nat gas fracking. It's basically dead.,"[Coal, is, dying, due, to, nat, gas, fracking, ., It's, basically, dead, .]",Coal is dying due to nat gas fracking. It's basically dead.
5,It's just a helicopter in helicopter's clothing,"[It's, just, a, helicopter, in, helicopter's, clothing]",It's just a helicopter in helicopter's clothing
6,It won't matter,"[It, won't, matter]",It won't matter
7,Pretty good,"[Pretty, good]",Pretty good
8,"Why did we waste so much time developing silly rockets? Damn you, aliens! So obtuse! You have all this crazy tech, but can't speak English!?","[Why, did, we, waste, so, much, time, developing, silly, rockets, ?, Damn, you, ,, aliens, !, So, obtuse, !, You, have, all, this, crazy, tech, ,, but, can't, speak, English, !, ?]","Why did we waste so much time developing silly rockets? Damn you, aliens! So obtuse! You have all this crazy tech, but can't speak English!?"
9,Technology breakthrough: turns out chemtrails are actually a message from time-traveling aliens describing the secret of teleportation,"[Technology, breakthrough, :, turns, out, chemtrails, are, actually, a, message, from, time-traveling, aliens, describing, the, secret, of, teleportation]",Technology breakthrough: turns out chemtrails are actually a message from time-traveling aliens describing the secret of teleportation


In [112]:
tokenizer = TweetTokenizer()
df1['tokenized_text'] = df1['decoded_text'].apply(lambda text: tokenizer.tokenize(text))

In [114]:
from nltk.tokenize import TweetTokenizer

# Original text with UTF-8 encoded characters
original_text = "Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6"

# Decode UTF-8 characters
decoded_text = original_text.encode('latin1').decode('utf-8')

# Tokenize using TweetTokenizer
tokenizer = TweetTokenizer()
tokenized_text = tokenizer.tokenize(decoded_text)

# Display the original and tokenized text
print("Original Text:", original_text)
print("Tokenized Text:", tokenized_text)

Original Text: Exactly. Tesla is absurdly overvalued if based on the past, but that's irrâ¦
Tokenized Text: ['Exactly', '.', 'Tesla', 'is', 'absurdly', 'overvalued', 'if', 'based', 'on', 'the', 'past', ',', 'but', "that's", 'irr', '…']


In [107]:
def decode_utf8(column):
    return column.apply(lambda text: text.encode('latin1').decode('utf-8'))

# Apply the decoding function to the 'text' column
df1['decoded_text'] = decode_utf8(df1['t_text'])
tokenizer = TweetTokenizer()
df['tokenized_text'] = df['decoded_text'].apply(lambda text: tokenizer.tokenize(text))


TypeError: expected string or buffer

Unnamed: 0,t_text,tokenized_text,decoded_text
1,"Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6","[Exactly, ., Tesla, is, absurdly, overvalued, if, based, on, the, past, ,, but, that's, irr, \, xe2, \, x80, \, xa6]","Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6"
2,"Et tu, Walt?","[Et, tu, ,, Walt, ?]","Et tu, Walt?"
3,Stormy weather in Shortville ...,"[Stormy, weather, in, Shortville, ...]",Stormy weather in Shortville ...
4,Coal is dying due to nat gas fracking. It's basically dead.,"[Coal, is, dying, due, to, nat, gas, fracking, ., It's, basically, dead, .]",Coal is dying due to nat gas fracking. It's basically dead.
5,It's just a helicopter in helicopter's clothing,"[It's, just, a, helicopter, in, helicopter's, clothing]",It's just a helicopter in helicopter's clothing
6,It won't matter,"[It, won't, matter]",It won't matter
7,Pretty good,"[Pretty, good]",Pretty good
8,"Why did we waste so much time developing silly rockets? Damn you, aliens! So obtuse! You have all this crazy tech, but can't speak English!?","[Why, did, we, waste, so, much, time, developing, silly, rockets, ?, Damn, you, ,, aliens, !, So, obtuse, !, You, have, all, this, crazy, tech, ,, but, can't, speak, English, !, ?]","Why did we waste so much time developing silly rockets? Damn you, aliens! So obtuse! You have all this crazy tech, but can't speak English!?"
9,Technology breakthrough: turns out chemtrails are actually a message from time-traveling aliens describing the secret of teleportation,"[Technology, breakthrough, :, turns, out, chemtrails, are, actually, a, message, from, time-traveling, aliens, describing, the, secret, of, teleportation]",Technology breakthrough: turns out chemtrails are actually a message from time-traveling aliens describing the secret of teleportation


In [100]:
original_text = "Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6"

# Tokenize using TweetTokenizer
tokenizer = TweetTokenizer()
tokenized_text = tokenizer.tokenize(original_text)

# Display the original and tokenized text
print("Original Text:", original_text)
print("Tokenized Text:", tokenized_text)

Original Text: Exactly. Tesla is absurdly overvalued if based on the past, but that's irrâ¦
Tokenized Text: ['Exactly', '.', 'Tesla', 'is', 'absurdly', 'overvalued', 'if', 'based', 'on', 'the', 'past', ',', 'but', "that's", 'irrâ', '\x80', '¦']


In [73]:
stopwords_english = stopwords.words('english') 

print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)

Stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so

In [79]:
df2 = pd.DataFrame({'tokenized_text': df2})

In [80]:
df2

Unnamed: 0,tokenized_text
1,"[Exactly, ., Tesla, is, absurdly, overvalued, if, based, on, the, past, ,, but, that's, irr, \, xe2, \, x80, \, xa6]"
2,"[Et, tu, ,, Walt, ?]"
3,"[Stormy, weather, in, Shortville, ...]"
4,"[Coal, is, dying, due, to, nat, gas, fracking, ., It's, basically, dead, .]"
5,"[It's, just, a, helicopter, in, helicopter's, clothing]"
6,"[It, won't, matter]"
7,"[Pretty, good]"
8,"[Why, did, we, waste, so, much, time, developing, silly, rockets, ?, Damn, you, ,, aliens, !, So, obtuse, !, You, have, all, this, crazy, tech, ,, but, can't, speak, English, !, ?]"
9,"[Technology, breakthrough, :, turns, out, chemtrails, are, actually, a, message, from, time-traveling, aliens, describing, the, secret, of, teleportation]"


In [87]:
tweets_clean = []
for token in df2['tokenized_text']:
    for word in token:# Go through every word in your tokens list
        if (word not in stopwords_english and word not in string.punctuation):  # remove stopwords remove punctuation
            tweets_clean.append(word)

print('removed stop words and punctuation:')
print(tweets_clean)

removed stop words and punctuation:
['Exactly', 'Tesla', 'absurdly', 'overvalued', 'based', 'past', "that's", 'irr', 'xe2', 'x80', 'xa6', 'Et', 'tu', 'Walt', 'Stormy', 'weather', 'Shortville', '...', 'Coal', 'dying', 'due', 'nat', 'gas', 'fracking', "It's", 'basically', 'dead', "It's", 'helicopter', "helicopter's", 'clothing', 'It', 'matter', 'Pretty', 'good', 'Why', 'waste', 'much', 'time', 'developing', 'silly', 'rockets', 'Damn', 'aliens', 'So', 'obtuse', 'You', 'crazy', 'tech', "can't", 'speak', 'English', 'Technology', 'breakthrough', 'turns', 'chemtrails', 'actually', 'message', 'time-traveling', 'aliens', 'describing', 'secret', 'teleportation']
