# 01 - Demo Exercises
Let's go through some demo exercises to introduce you to Social Media Analytics using Twitter data, and learn some Python along the way.

## Load Data
Let's use the `pandas` library to load a `.json` file into a `pandas` object containing the text data from a sample of [Twitter](https://about.twitter.com/en) tweets. We will also name the values-column `text` and print the `head()` too so we can see what the object looks like.

In [None]:
import pandas as pd
df_WedMot = pd.read_json('../data/WednesdayMotivation_tweets_text.json')
df_WedMot.head()

## Describe Data

### Identify Retweets
Let's identify which tweets in the dataset are [Retweets](https://help.twitter.com/en/using-twitter/retweet-faqs) (RT). We do this using the [`apply()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) method and using the [`lamda`](https://www.w3schools.com/python/python_lambda.asp) notation to define an inline function. Runt he code below.

In [None]:
df_WedMot['isRT'] = df_WedMot['full_text'].apply(lambda s: s.startswith('RT'))
df_WedMot.head()

### Count Retweets
Now let's count how many tweets are Retweets by simply counting the number of `True` values in the `isRT` column. Run teh code below.

In [None]:
numRT = df_WedMot['isRT'].sum()
print(f'Number of Retweets = {numRT}')

Note how we use a formatted string ([f-string](https://docs.python.org/3/tutorial/inputoutput.html)) to print the result.

### Identify mentions
Now lets identify which tweets contain at least one [Mention](https://help.twitter.com/en/using-twitter/types-of-tweets).

In [None]:
df_WedMot['has@'] = df_WedMot['full_text'].apply(lambda s: '@' in s)
df_WedMot.head()

### Task 01-01: Count tweets containing at least one mention
Use the methods we used above to count the number of tweets containing at least one mention.

In [None]:
# (SOLUTION)


## Predict Sentiment
Let's now perform some further analysis of the tweet text data and try to predict the sentiment of each tweet.

### Setup our Sentiment Predicter
Run the code below to download the [VADER](https://ojs.aaai.org/index.php/ICWSM/article/view/14550) [lexicon](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt), create an instance of an [NLTK](https://www.nltk.org/) [SentimentIntensityAnalyzer](https://www.nltk.org/_modules/nltk/sentiment/vader.html#SentimentIntensityAnalyzer), and build a `predict_sentiment` function.

In [None]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
scorer = SentimentIntensityAnalyzer()

def predict_sentiment(text_string):
    return(scorer.polarity_scores(text_string)['compound'])

### Apply our Sentiment Predictor
Run the code below to apply our sentiment predictor to the text in each row.

In [None]:
df_WedMot['sentiment'] = df_WedMot['full_text'].apply(predict_sentiment)
df_WedMot.head()

## Explore Predictions
Now we have predicted the sentiment of each tweet, let's explore the predicted values.

### Visualise Predictions
Firstly, we can plot the predicted values to better understand the distribution of the (predicted) outcome variable for the sample of tweets. Fun the code below.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
bins = np.linspace(-1, 1, 20)
plt.hist(df_WedMot['sentiment'], bins, alpha=0.5, label='WedMot')
plt.legend(loc='upper right')
plt.show()

### Describe Predictions
We can also analyse the predicted values, in the same way we would analyse observed values. For example, run the code below to calculate the mean sentiment over all the tweets.

In [None]:
mean_sentiment = df_WedMot['sentiment'].mean()
print(f'Mean sentiment = {mean_sentiment}')

### Categorise tweets
We can also use the predicted values to categorise the tweets using simple thresholding.

In [None]:
df_WedMot['class'] = df_WedMot['sentiment'].apply(lambda s: -1 if s < 0 else 1)
df_WedMot.head()

*Question* : Do you think this threshold is a good way to categorise the tweets?

### Task 01-02 : Count group sizes
Count the number of tweets which fall into the two classes we defined above.

In [None]:
#(SOLUTION)


*Question* : Do you still think it is a good way (or not) to categorise the tweets?

### Inspect most positive tweets
Let's have a quick look at the 3 tweets with most positive sentiment.

In [None]:
def print_bold(text):
    print('\033[1m' + text + '\033[0m')

print_bold("\n Tweets with most positve sentiment sentiment")
for full_text in df_WedMot.sort_values(by=['sentiment'], ascending=False)['full_text'].head(3):
    print_bold("\n Tweet:")
    print(full_text)

### <ins>Advanced</ins> Task 01-03: Organise tweets by predicted sentiment and reflect
<span style="color: red;">**You do not need to do this task, but you can if you want!**</span>

Use what you have learned before (in other units) to identify the:
* the tweets with most positive sentiment
* the tweets with most negative sentiment
* the tweets with zero sentiment

*Question* : Do the results look like you might expect?
*Question* : How do you think this sample of tweets was collected?

In [None]:
#(SOLUTION)
