# Social Media Analytics
Social media analytics is the practice of gathering data from social media websites and analyzing that data using social media analytics tools to make business decisions. The most common use of social media analytics is to mine customer sentiment to support marketing and customer service activities. <br>

The first step in a social media intelligence initiative is to determine which business goals the data that is gathered and analyzed will benefit. Typical objectives include increasing revenues, reducing customer service costs, getting feedback on products and services, and improving public opinion of a particular product or business division. <br>

Once the business goals have been identified, businesses should define key performance indicators (KPIs) to objectively evaluate the business analytics data. <br>

[ref](https://searchbusinessanalytics.techtarget.com/definition/social-media-analytics)

### About this Dataset

This data originally came from Crowdflower's Data for Everyone library. As the original source says:

> A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

## Table Of Content
### 1. Preprocessing
### 2. Exploratory data analysis (EDA)
### 3. Conclusion

---

## 0. Load Data & Libraries

First thing, we need to import some libraries and load the data.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#import mathematical & dataframe module
import numpy as np 
import pandas as pd

#import text module
import wordcloud as wc
import numpy as np
import textblob
import re, string, unicodedata
from bs4 import BeautifulSoup
from tqdm import tqdm
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import TextBlob
from textblob import Word


#import visualization module 
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
sns.set(style='darkgrid')
from wordcloud import WordCloud, STOPWORDS

In [None]:
tweet = pd.read_csv('/kaggle/input/twitter-airline-sentiment/Tweets.csv')
tweet.head()

---

## 1. Preprocessing

Data-gathering methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results.

[ref](https://en.wikipedia.org/wiki/Data_pre-processing)

In [None]:
#copy dataframe to tweet2
tweet2 = tweet.copy()

In [None]:
tweet_drop = tweet2.drop(columns=['tweet_id','tweet_created','tweet_location','user_timezone','tweet_coord'],axis=1)

In [None]:
tweet_drop.shape

In [None]:
#drop duplicates from data
tweet_drop.drop_duplicates(subset ="text",keep = False, inplace = True)

In [None]:
#check data variables
tweet_drop.info()

In [None]:
#data completeness in range 0-100
tweet_drop.count().sort_values(ascending = False) / len(tweet_drop)*100

In [None]:
tweet_drop['text'].head()

### 1.1 Multiple Step Preprocessing

We want to make wordcloud, so we must do some preprocessing step to achieve that. <br>
[preprocessing ref](https://link.springer.com/chapter/10.1007/978-3-319-67008-9_31)

In [None]:
#cleansing 
def preprocessing(text):
    
    def removeUnicode(text):
        
        text = re.sub(r'(\\u[0-9A-Fa-f]+)','', text)       
        text = re.sub(r'[^\x00-\x7f]','',text)
        return text

    def replaceURL(text):
        
        text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',text)
        text = re.sub(r'#([^\s]+)', '', text)
        return text

    def replaceAtUser(text):
        
        text = re.sub('@[^\s]+','',text)
        return text

    def removeHashtagInFrontOfWord(text):
        
        text = re.sub(r'#([^\s]+)', r'\1', text)
        return text

    def removeNumbers(text):
        
        text = ''.join([i for i in text if not i.isdigit()])         
        return text

    def replaceMultiExclamationMark(text):
        
        text = re.sub(r"(\!)\1+", '', text)
        return text

    def replaceMultiQuestionMark(text):
        
        text = re.sub(r"(\?)\1+", '', text)
        return text

    def replaceMultiStopMark(text):
        
        text = re.sub(r"(\.)\1+", '', text)
        return text
    
    def removeEmoticons(text):
        
        text = re.sub(':\)|;\)|:-\)|\(-:|:-D|=D|:P|xD|X-p|\^\^|:-*|\^\.\^|\^\-\^|\^\_\^|\,-\)|\)-:|:\'\(|:\(|:-\(|:\S|T\.T|\.\_\.|:<|:-\S|:-<|\*\-\*|:O|=O|=\-O|O\.o|XO|O\_O|:-\@|=/|:/|X\-\(|>\.<|>=\(|D:', '', text)
        return text

    
    contraction_patterns = [ (r'won\'t', 'will not'), (r'can\'t', 'cannot'), (r'i\'m', 'i am'), (r'ain\'t', 'is not'), (r'(\w+)\'ll', '\g<1> will'), (r'(\w+)n\'t', '\g<1> not'),
                             (r'(\w+)\'ve', '\g<1> have'), (r'(\w+)\'s', '\g<1> is'), (r'(\w+)\'re', '\g<1> are'), (r'(\w+)\'d', '\g<1> would'), (r'&amp', ''), (r'dammit', 'damn it'), (r'dont', 'do not'), (r'wont', 'will not') ]
    def replaceContraction(text):
        patterns = [(re.compile(regex), repl) for (regex, repl) in contraction_patterns]
        for (pattern, repl) in patterns:
            (text, count) = re.subn(pattern, repl, text)
        return text

    def replaceElongated(word):
        

        repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        repl = r'\1\2\3'
        if wordnet.synsets(word):
            return word
        repl_word = repeat_regexp.sub(repl, word)
        if repl_word != word:      
            return replaceElongated(repl_word)
        else:       
            return repl_word

    def removeEmoticons(text):
        
        text = re.sub(':\)|;\)|:-\)|\(-:|:-D|=D|:P|xD|X-p|\^\^|:-*|\^\.\^|\^\-\^|\^\_\^|\,-\)|\)-:|:\'\(|:\(|:-\(|:\S|T\.T|\.\_\.|:<|:-\S|:-<|\*\-\*|:O|=O|=\-O|O\.o|XO|O\_O|:-\@|=/|:/|X\-\(|>\.<|>=\(|D:', '', text)
        return text
    # Removes unicode strings like "\u002c" and "x96"
    text = removeUnicode(text)
    # Replaces url address with "url"
    text = replaceURL(text)
    # Removes hastag in front of a word
    text = replaceAtUser(text)
    # Replaces "@user"
    text = removeHashtagInFrontOfWord(text)
    # Removes integers 
    text = removeNumbers(text)
    # Replaces repetitions of exlamation marks
    text = replaceMultiExclamationMark(text)
    # Replaces repetitions of question marks
    text = replaceMultiQuestionMark(text)
    # Replaces repetitions of stop marks
    text = replaceMultiStopMark(text)
    # Removes emoticons from text
    text = removeEmoticons(text)
    # Replaces contractions from a string to their equivalents
    text = replaceContraction(text)
    # Replaces an elongated word with its basic form, unless the word exists in the lexicon
    text = replaceElongated(text)
    # Removes emoticons from text
    text = removeEmoticons(text)
    
    return text.lower()

In [None]:
tweet_drop['text'] = tweet_drop['text'].apply(preprocessing)
tweet_drop['text'].head()

### 1.2 Punctuation removal

method to remove punctuation on tweets.

In [None]:
tweet_drop['text'] = tweet_drop['text'].str.replace('[^\w\s]','')
tweet_drop['text'].head()

### 1.3 Stopword Removal

method to remove stop word on tweets.

In [None]:
stop = stopwords.words('english')
tweet_drop['text'] = tweet_drop['text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
tweet_drop['text'].head()

### 1.4 Remove Frequent Word

method to erase frequent word on tweets.

In [None]:
freq = pd.Series(' '.join(tweet_drop['text']).split()).value_counts()[:3]
freq

In [None]:
freq = list(freq.index)
tweet_drop['text'] = tweet_drop['text'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
tweet_drop['text'].head()

### 1.5 Rare Word Removal

method to erase rare word on tweets.

In [None]:
rare_word = pd.Series(' '.join(tweet_drop['text']).split()).value_counts()[-5000:]
rare_word

In [None]:
rare_word = list(rare_word.index)
tweet_drop['text'] = tweet_drop['text'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
tweet_drop['text'].head()

### 1.6 Correction Word

method to correction some word. but, if we use this method to our dataset would be a problem for running time. If you want, you can change '10' to ' ', it will correct every word in dataset.

In [None]:
tweet_drop['text'][:10].apply(lambda x: str(TextBlob(x).correct()))

### 1.7 Stemming & Lemmatization

I choose to not using Stemming and Lemmatization because it will hard to recognize/read when we make the wordcloud. If you want to use that, just erase the hashtag. 

In [None]:
#Stemming

# st = PorterStemmer()
# tweet_drop['text'] = tweet_drop['text'][:].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
# tweet_drop['text'].head()

In [None]:
#Lemmatization

# tweet_drop['text'] = tweet_drop['text'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
# tweet_drop['text'].head()

## 2. Exploratory data analysis (EDA)

In this part, we will explore what happend/do some data analysis with this dataset. 

In [None]:
df = tweet_drop.copy()
df.head()

In [None]:
tweet['tweet_created'] = tweet.tweet_created.str[:10]
date = tweet['tweet_created'].unique().tolist()
date

Our data is collected on 16 February 2015 - 24 February 2015.

We want too see our sentiment on tweets.

In [None]:
fig = plt.figure(figsize=(8,6))
plt.title('Count Plot',fontsize = 20)
ax=sns.countplot(data=df, x='airline_sentiment',order = df['airline_sentiment'].value_counts().index)

plt.figure(figsize=(8,6))
sns.countplot(x=df["airline"])
plt.title("Airlines Distribution")

ax.set_xlabel('airline_sentiment', fontsize = 15)
ax.tick_params(labelsize=12)

The count plot shows that the majority of data we are working contains so much tweets that have negative sentiments. What about the airlines? the tweet are dominant from United company and the least is Virgin America. <br> <br>
The question is, are the company have so many negative sentiment tweets too?

In [None]:
def plot_sub_sentiment(Airline):
    data=df[df['airline']==Airline]
    count=data['airline_sentiment'].value_counts().index
    ax=sns.countplot(data=data, x='airline_sentiment',order = count)
    plt.title('Count Plot '+Airline,fontsize = 15)
    plt.ylabel('Sentiment Count')
    plt.xlabel('Mood')
    
plt.figure(1,figsize=(15, 15))
plt.subplots_adjust(left=0.125, bottom=0.1, right=0.9, top=0.9, wspace=0.2, hspace=0.2)
plt.subplot(231)
plot_sub_sentiment('US Airways')
plt.subplot(232)
plot_sub_sentiment('United')
plt.subplot(233)
plot_sub_sentiment('American')
plt.subplot(234)
plot_sub_sentiment('Southwest')
plt.subplot(235)
plot_sub_sentiment('Delta')
plt.subplot(236)
plot_sub_sentiment('Virgin America')

We can get insight in 16 February until 24 February, US Airways, American Airways and United Airways got negative sentiment from public. Although other companies also get negative sentiment too, but not as many as these 3 companies.

 Lets see whats make them got negative sentiment/problem from that 3 airlines (US Airways, American Airways and United Airways).

In [None]:
df.head()

In [None]:
three_airlines = df.copy()
three_airlines = three_airlines[(three_airlines['airline'] == 'US Airways') | (three_airlines['airline'] == 'United') | (three_airlines['airline'] == 'American')]
three_airlines = three_airlines[three_airlines['airline_sentiment'] == 'negative']

In [None]:
three_airlines.head()

In [None]:
plt.figure(figsize=(22, 5))
ax = sns.countplot(x="negativereason", hue="airline", data=three_airlines)
plt.xticks(rotation=15)
plt.ylabel('Sentiment Count')
plt.xlabel('Negative Reason')

In 16 February until 25 February, we can see:
> for United airlines and US Airways the problem is Customer Service and Late Flight. <br>

> for American airlines the problem is Customer Service and Cancelled Flight. <br>

### Wordcloud

In [None]:
#define happy and not happy 
sentiment_positive = df.loc[df['airline_sentiment'] == "positive"]
sentiment_neutral  = df.loc[df['airline_sentiment'] == "neutral"]
sentiment_negative = df.loc[df['airline_sentiment'] == "negative"]

In [None]:
sentiment_positive.head()

In [None]:
#merge all the happy comments into one paragraph
all_description_happy = "".join(sentiment_positive['text'].values)
all_description_neutral = "".join(sentiment_neutral['text'].values)
all_description_not_happy = "".join(sentiment_negative['text'].values)

In [None]:
def create_word_cloud(string):
    plt.figure(1,figsize=(10, 10))
    cloud = WordCloud(background_color = "white",width=1000,
                      height=500, max_words = 150, stopwords = set(STOPWORDS)).generate(string)
    plt.imshow(cloud, interpolation='bilinear')
    
    plt.axis('off')
    plt.show()

### happy wordcloud

In [None]:

create_word_cloud(all_description_happy)

As we can see, 'thank' word is the biggest (thats mean so many word on positive sentiment) word on the wordcloud. 'great' and 'time' are the top 2 after thank. Thats because time means the flight are on time and have a great quality or experience.

### neutral wordcloud

In [None]:
#neutral
create_word_cloud(all_description_neutral)

On neutral wordcloud, the 'flight' word are the biggest. We can see the small part are dominant, in my opinion thats because frequency of word are so much. The other intersting part is 'time' and 'help' are top 6 from the neutral sentiment. 

### not happy wordcloud

In [None]:
#not happy 
create_word_cloud(all_description_not_happy)

'One' and 'hour' word explain the late of the flight and followed by 'time', so its must be the time problem. 'stil' can be interpreted as 'still at the airport'.

---

### Conclusion
> 1. There are so much negative sentiment to the six airlines on 16 february to 24 february.

> 2. Top 3 negative sentiment (US Airways, American Airways and United Airways), are having same problem: Customer Sercive Issue.

> 3. The positive sentiment can be represented by word 'thank', the negative with 'hour' and the neutral with 'flights'.

***.avo***