## ▷▷▷▷▷ 💡 Introduction 💡 ◁◁◁◁◁

1. Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

2. Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

## ▷▷▷▷▷ 💡 Recap 💡 ◁◁◁◁◁

In Data Visulization, pandas and numpy are commonly used to load and understand the data we are working with

In [None]:
media_df = pd.read_csv('./social_media.csv')

In [None]:
media_df.head()

In [None]:
media_df.tail()

In [None]:
media_df.shape

In [None]:
media_df.columns

In [None]:
# Converting the names of all the columns to lowercase
media_df.columns = [col.lower() for col in media_df]
media_df.columns

In [None]:
# Extracting data for facebook, twitter, instagram, youtube
pops_col = media_df[['facebook', 'twitter', 'instagram', 'youtube']]
pops_col.head()

In [None]:
# Slicing by rows
pops_subset = media_df.iloc[1: 3]
pops_subset.head()

## ▷▷▷▷▷ 🆕 Matplotlib 🆕 ◁◁◁◁◁

Now that we have a better understanding of our data, we can explore ways we can represent these data on a graph

### [🌟] C01: Line Graphs

*Your First Plot*

In [None]:
# Plotting the change in 'facebook' user data across the years
plt.plot(media_df['date'], media_df['facebook'])

plt.show()

*Multi-line*

In [None]:
# Method 1
plt.plot(media_df['date'], media_df['facebook'], media_df['date'], media_df['twitter'], media_df['date'], media_df['youtube'])

plt.show()

In [None]:
# Method 2
plt.plot(media_df['date'], media_df['facebook'], label="facebook")
plt.plot(media_df['date'], media_df['twitter'], label="twitter")
plt.plot(media_df['date'], media_df['youtube'], label="youtube")

plt.show()

*Title, X-label, Y-label*

In [None]:
# Setting title, xlabel and y-label
plt.plot(media_df['date'], media_df['facebook'])

# Title, X-label, Y-label
plt.title("Facebook Popularity")
plt.xlabel("Year 2009-2023")
plt.ylabel("Market Share")

plt.show()

*Xlim, Ylim, Xticks, Yticks*

In [None]:
# Adjusting axis scale
# Setting x, y limit
plt.plot(media_df['date'], media_df['facebook'])
plt.ylim = [0, 120] # Why is nothing happening 🤔 ?

# Title, X-label, Y-label
plt.title("Instagram Popularity")
plt.xlabel("Year 2009-2023")
plt.ylabel("Market Share")

plt.show()

In [None]:
# Setting the values on the x, y-axis
plt.plot(media_df['date'], media_df['facebook'])
plt.yticks(range(0, 121, 20))

plt.show()

*Annotations*

In [None]:
# Annotating Text
plt.plot(media_df['date'], media_df['facebook'])
plt.yticks(range(0, 121, 20))

# Annotation
plt.annotate('peak', xy=(80, 88), xytext=(100, 110), arrowprops=dict(facecolor='black', shrink=0.10))

# Title, X-label, Y-label
plt.title("Instagram Popularity")
plt.xlabel("Year 2009-2023")
plt.ylabel("Market Share")

plt.show()

*Legend*

In [None]:
# Notice how the plot quickly gets very confusing
# Adding Legends

plt.plot(media_df['date'], media_df['facebook'], label="facebook")
plt.plot(media_df['date'], media_df['twitter'], label="twitter")
plt.plot(media_df['date'], media_df['youtube'], label="youtube")
plt.yticks(range(0, 121, 20))

# Legends
plt.legend(bbox_to_anchor=(1.30, 1))

plt.show()

### [🌟] [🌟] [❗] Challenge: Putting everything together (8min)

Your Graph should have the following:
1. Multi-line (Facebook, Twitter, YouTube)

2. A suitable y-axis scale

3. Legend

4. Annotation of highest point

5. Title

6. X-label

7. Y-label

In [None]:
# Putting it all together

plt.plot(media_df['date'], media_df['facebook'], label="facebook")
plt.plot(media_df['date'], media_df['twitter'], label="twitter")
plt.plot(media_df['date'], media_df['youtube'], label="youtube")
plt.yticks(range(0, 121, 20))

# Legends
plt.legend(bbox_to_anchor=(1.30, 1)) 

# Annotation
plt.annotate('peak', xy=(80, 88), xytext=(100, 110), arrowprops=dict(facecolor='black', shrink=0.10))

# Title, X-label, Y-label
plt.title("Instagram Popularity")
plt.xlabel("Year 2009-2023")
plt.ylabel("Market Share")

plt.show()

### [🌟] [🌟] C02: Pie Charts

In [None]:
# Extracting data of all social media in a specific month
x = media_df.columns
y = media_df.iloc[0].tolist()
print(x)
print(y)

In [None]:
# Representing the market share as a pie chart
fig = plt.figure()

# Set Background Colour
fig.set_facecolor('white')

# Likewise, pie charts can be labelled
plt.pie(y[1:], labels=x[1:])
plt.legend(bbox_to_anchor=(1.5, 1.1))
plt.title("Market Share of all social media")

plt.show()

### [🌟] [🌟] [🌟] [❗] Challenge: Can you make the pie chart look less cluttered?

-   Hint: Group the social media platforms with smaller market share

In [None]:
# Solution Part 1
# Calculating total percentage
total = 0
for value in y[1:]: total += value
# print(total) # Check whether market share adds up to 100%

sign_players_x = []
sign_players_y = []
others_x = []
others_y = 0

In [None]:
# Solution Part 2
# Sieving through market share of different social media platforms
for (i, value) in enumerate(y[1:]):
    if value/total*100 > 5:     
        sign_players_x.append(x[i + 1])
        sign_players_y.append(value)
    else:
        others_x.append(x[i + 1])
        others_y += value

print(sign_players_x, sign_players_y)
print(others_x, others_y)

In [None]:
# Solution Part 3
# Concatenate the lists
new_x = sign_players_x + ["others"]
new_y = sign_players_y + [others_y]

print(new_x)
print(new_y)

In [None]:
# Solution Results
fig = plt.figure()
fig.set_facecolor("white")

plt.pie(new_y, labels=new_x)
plt.legend(bbox_to_anchor=[1.2, 0.7])
plt.title("Market Share of Significant Social Media Platform")

plt.show()

### [🌟] C03: Plot Arrangement

In [None]:
# Line Graph
plt.figure(1)
plt.subplot(212)
plt.plot(media_df['date'], media_df['facebook'])

# Pie Chart
fig = plt.figure(2)
fig.set_facecolor("white")

plt.pie(new_y, labels=new_x)
plt.legend(bbox_to_anchor=[1.2, 0.7])
plt.title("Market Share of Significant Social Media Platform")

plt.show()

## ▷▷▷▷▷ 🆕 Seaborn 🆕 ◁◁◁◁◁

Seaborn is an extension of Matplotlib that allows us to work with data more easily. It can be used with data instantly out-of-the-box, making it even easier to do data analysis.

In [None]:
import seaborn as sns

# Fun Fact: The Seaborn library was apparently named after a character named Samuel Norman Seaborn from the television show "The West Wing" 
# Thus, the standard alias is the character's initials ("sns").

## ▷▷▷▷▷ 📈 More Data 📈 ◁◁◁◁◁

We will be experimenting with 3 different datasets to explore the powers of Seaborn.

### Dataset 1: Social Media

This is the same one as used in Matplotlib section!  

We modify the data to set the date as the index and to only keep certain columns to avoid clutter.

In [None]:
# convert the names of all the columns to lowercase
media_df.columns = [col.lower() for col in media_df]
# set the date column as the index
media_df = media_df.set_index("date") 
print(media_df.columns)

# only keep the columns shown below to avoid clutter in the graph
preserve_columns = ["facebook", "twitter", "stumbleupon", "myspace", "digg", "reddit", "other"]
for index, row in media_df.iterrows():
    other = row["other"]
    for column in media_df.columns:
        if column in preserve_columns or column == "other":
            continue
        # for all other social media platforms, add to "other"
        other += row[column] 
        row[column] = 0
    row["other"] = other
# update and drop the columns no longer needed
media_df = media_df[preserve_columns]

In [None]:
media_df.head()

### Dataset 2: Social Media (Modified)

This is the same data as above! Run this code and look for the difference in how the data is structured.  

Dataset 2 is a bit more "messy", with all the different dates and platforms combined into 1 column instead of multiple columns.

In [None]:
media_messy_df = pd.read_csv('./social_media_messy.csv')
media_messy_df = media_messy_df.set_index("date") 
print(media_messy_df.columns)

In [None]:
media_messy_df.head()

### Dataset 3: Students Performance Data (Modified)

This is the same data as above! Run this code and look for the difference in how the data is structured.  

This dataset records students data and related student information.

In [None]:
students_df = pd.read_csv('./students_performance.csv')  

In [None]:
students_df.head()

## ▷▷▷▷▷ 🌊 Using Seaborn 🌊 ◁◁◁◁◁

In [None]:
sns.set_theme(style="whitegrid", palette="tab10") # sets a colour scheme

### [🌟] C01: Line Graphs

We investigate how the usage of social media platforms changes over time.

In [None]:
# From Dataset 1
graph=sns.lineplot(data=media_df, linewidth=2.5, dashes=False) # plots all the data in a line plot based on platforms

In [None]:
# From Dataset 1
graph=sns.lineplot(data=media_df, linewidth=2.5, dashes=False)

# Notice how all the dates are combined and too messy to read?
# We do this to only label dates that are in the month of January (e.g. 2020-01)
# We also label it in the form '20, '21
xticks, xticklabels = [], []
for idx, time in enumerate(media_df.index):
    if time[-3:] == "-01":
        xticks.append(idx) 
        xticklabels.append("'"+time[2:-3])

graph.set_xticks(xticks)
graph.set_xticklabels(xticklabels)

# set x limit to avoid excessive white space
graph.set_xlim(0, len(media_df.index)-1)

In [None]:
# From Dataset 2
# Note the extra parameters to give seaborn more info about what we're plotting.
graph=sns.lineplot(data=media_messy_df, x="date", y="value", hue="platform", linewidth=2.5)

graph.set_xticks(xticks)
graph.set_xticklabels(xticklabels)
graph.set_xlim(0, len(media_df.index)-2)

### [🌟] C02: Distribution Plot

We investigate how certain data or attributes may affect students' math, reading and writing score. 

In [None]:
# Distribution Plot (Count)
sns.displot(
    data=students_df, x="math score", col="ethnicity", row="gender",
    binwidth=5, facet_kws=dict(margin_titles=True)
)

### [🌟] [🌟] Challenge: Can you use a percetange instead of count to make the graphs easier to compare?

-   The different totals in each graph makes it hard to draw a comparison.

In [None]:
# Challenge 5: Solution
sns.displot(
    data=students_df, x="math score", col="ethnicity", row="gender",
    binwidth=5, facet_kws=dict(margin_titles=True), stat='percent', common_norm=False
)

### [🌟] C03: Relationship Plot

We investigate how one score can be correlated to another score, and other attributes. 

In [None]:
# Relationship Plot (Simple)
sns.relplot(data=students_df, x="reading score", y="math score", alpha=0.5)

In [None]:
# Relationship Plot (Colour, Categorical)
sns.relplot(data=students_df, x="reading score", y="math score", hue="gender", alpha=0.5)

In [None]:
# Relationship Plot (Colour, Numerical)
sns.relplot(data=students_df, x="reading score", y="math score", hue="writing score", alpha=0.5)

In [None]:
# Relationship Plot (Size)
sns.relplot(data=students_df, x="reading score", y="math score", size="writing score", sizes=(2, 100), alpha=0.5)

### [🌟] C04: Heatmap Plot

The correlations between scores may be a bit hard to observe as seen above. We can instead use a heatmap to visualise this more easily.

In [None]:
# we investigate how WRITING score is predicted by reading/math score
students_pivoted = students_df.pivot_table(index="math score", columns="reading score", values="writing score", aggfunc="mean", fill_value=None) 
students_pivoted.head()

In [None]:
# Set None values to white, all other values correspond to a colour on the spectrum.
cmap = sns.color_palette("viridis_r", as_cmap=True)
cmap.set_bad("white")

sns.heatmap(data=students_pivoted, annot=False, linewidths=.5, cmap=cmap, cbar_kws={'label': 'writing score'})