# Introduction

In this project, I will analyze text that was scrapped from twitter that I did in: [Twitter Scrapping with Python](https://github.com/teguhsam/twitter_scraping_with_PYTHON).

# Import Text

In [1]:
with open('tweets_1000.json') as json_file:
    json_text = json.load(json_file)

print(type(json_text))

<class 'list'>


In [2]:
json_text[0:3]

['RT @SBSNews: The Morrison government is set to adopt a technology investment target instead of signing up to a global agreement to achieve…',
 '@ColinCowherd im for steroids and using technology to steal signs',
 "@peachpanini of course we'd have the technology to show them on a reasonable scale via your tv/computer monitor. think of the possibilities"]

# Tokenize Text

Resource: https://machinelearningmastery.com/clean-text-machine-learning-python/

## Tokenize into words

In [21]:
from nltk.tokenize import word_tokenize

## Create an empty list for

In [45]:
token_words_list = list()

## Tokenize into words

In [46]:
for index, tweet in enumerate(json_text):
    #print(json_text[index])
    tokenized_words = word_tokenize(json_text[index])
    token_words_list.extend(tokenized_words)

In [50]:
len(token_words_list)

23997

# Clean words

To analyze the text above, cleaning needs to be done to remove punctuations such as @,#,! etc. Stopwords such as: we, you, and, they are common words that are also need to be removed.

## Remove punctuation

In [67]:
token_punc_removed = list()

In [68]:
for word in token_words_list:
    if word.isalpha():
        # before appending the word, put the word in a lower case
        token_punc_removed.append(word.lower())

In [69]:
len(token_punc_removed)

16843

## Remove Stopwords

In [65]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/MBAN/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [77]:
stop_words = set(stopwords.words('english'))

In [78]:
token_stopwords_removed = list()

for word in token_punc_removed:
    if word not in stop_words:
        token_stopwords_removed.append(word)

In [80]:
print("After Stopwords were removed, there are : {} remaining".format(len(token_stopwords_removed)))

After Stopwords were removed, there are : 10941 remaining


# Analyze words

## Count word occurance

In [83]:
from collections import Counter

In [87]:
word_counts = Counter(token_stopwords_removed)
type(word_counts)

collections.Counter

In [102]:
import pandas as pd
word_counts_df = pd.DataFrame.from_dict(word_counts, 
                                        orient='index', 
                                        columns=['count'])
word_counts_df.head()

Unnamed: 0,count
rt,626
sbsnews,2
morrison,3
government,10
set,4


## Find 10 most popular words

In [105]:
word_counts_df.sort_values(by='count', ascending=False).head(10)

Unnamed: 0,count
rt,626
https,456
technology,276
put,90
farmer,83
intelligence,80
bloomberg,77
anyone,74
farmers,71
data,64


## We can see above that 'farmer' and 'farmers' are considered two different words. Lets Lemmatize!

https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

In [111]:
nltk.download('wordnet')
wnl = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /Users/MBAN/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [112]:
wnl.lemmatize('farmers')

'farmer'

In [119]:
word_lemmatized_list = list()
for word in token_stopwords_removed:
    w = wnl.lemmatize(word)
    word_lemmatized_list.append(w)

In [121]:
word_counts = Counter(word_lemmatized_list)

word_counts_df = pd.DataFrame.from_dict(word_counts, 
                                        orient='index', 
                                        columns=['count'])

In [122]:
word_counts_df.sort_values(by='count', ascending=False).head(10)

Unnamed: 0,count
rt,626
http,456
technology,282
farmer,154
put,90
intelligence,80
bloomberg,77
anyone,74
teach,66
data,64
