## Step 2: Exploratory Data Analysis

#### Name: Tian Lan

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [4]:
clean_df = pd.read_csv('clean_data.csv')

clean_df.sample(5)

Unnamed: 0,User_ID,Hashtags,Song Title,Video Length,Likes,Shares,Comments,Views,Followers,Total Likes,Total Videos,Upload Year,Upload Month,Upload Day,Upload Weekday,Upload Period,Total Engagement,Engagement Rate
25379,6890647806692871426,"['parati', 'fyp']",Original Sound,9,8266,34,155,78600,9900000,118800000,703,2020,11,2,0,Evening,8455,0.000854
25395,6890621803505519873,['dueto'],The cigarette duet by princess Chelsea,9,95300,534,2028,568800,5500000,150800000,1193,2020,11,2,0,Evening,97862,0.017793
15316,6884361854714383621,[],Knock Knock,11,44000,94,204,297100,5300000,108000000,455,2020,10,16,4,Evening,44298,0.008358
15177,6868372637559590150,"['popsmoke', 'AimForTheMoon']",Aim For The Moon,13,298900,5313,4291,3000000,2400000,13500000,45,2020,9,3,3,Evening,308504,0.128543
34126,6887622271150656774,[],original sound,40,24800,9,63,109300,2100000,31800000,602,2020,10,25,6,Evening,24872,0.011844


In [5]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52844 entries, 0 to 52843
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   User_ID           52844 non-null  int64  
 1   Hashtags          52844 non-null  object 
 2   Song Title        52844 non-null  object 
 3   Video Length      52844 non-null  int64  
 4   Likes             52844 non-null  int64  
 5   Shares            52844 non-null  int64  
 6   Comments          52844 non-null  int64  
 7   Views             52844 non-null  int64  
 8   Followers         52844 non-null  int64  
 9   Total Likes       52844 non-null  int64  
 10  Total Videos      52844 non-null  int64  
 11  Upload Year       52844 non-null  int64  
 12  Upload Month      52844 non-null  int64  
 13  Upload Day        52844 non-null  int64  
 14  Upload Weekday    52844 non-null  int64  
 15  Upload Period     52844 non-null  object 
 16  Total Engagement  52844 non-null  int64 

#### Distribution of the Data

- Distribution of Engagement Features (Views, Likes, Comments, Shares)

In [None]:
engagement_columns = ['Likes', 'Shares', 'Comments', 'Views']

plt.figure(figsize=(10, 6))

for i, col in enumerate(engagement_columns):
    plt.subplot(2, 2, i+1)
    sns.histplot(clean_df[col], bins=80, kde=True)
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.title(f'Distribution of {col}')

plt.tight_layout()
plt.show()

from above we can see that the features related to enagement is heavilt skewed and concentrated within the lower range. the data is not  

- Distribution of Upload Time and the Period - provide insights into temporal patterns and trends

- Distribution of Followers and Total Likes - gain a better understanding of the range and diversity of user profiles, including both influencers and regular users, within the dataset

#### Grouping data with duplicated meaning

After examination of the dataset, it has been discovered that the "Song Title" column contains values indicating the absence of background music in the videos. These values are expressed in multiple languages, suggesting that they all signify the usage of the original sound in the videos. To ensure consistency and facilitate analysis, I will standardize these values by replacing all language variants with the term "original sound." By treating these entries as a unified group, further analysis can be conducted to explore the characteristics and patterns associated with videos that utilize the original sound.

In [None]:
clean_df['Song Title'].value_counts()

In [None]:
song_title_ori = clean_df[clean_df['Song Title'].str.contains('ori', case=False)]['Song Title'].value_counts()

song_title_ori

In [None]:
# only display the song title contain 'ori' that occur more than 10 times

song_title_ori[song_title_ori > 10].reset_index()

In [None]:
# Original Sound in Multiple Languages
# Song Titles Starting with "Original Sound - Username" representing 
# the song is originally created by the user indicated in the username

orig_sounds = ['original sound', 'sonido original', 'som original', 
               'Originalton', 'Original Sound', 'orijinal ses', 
               'son original', 'оригинальный звук', 'suono originale', 
               'origineel geluid']

In [None]:
clean_df['Song Title'] = clean_df['Song Title'].replace(orig_sounds, 'original sound')

In [None]:
clean_df['Song Title'].value_counts()

Based on the observation, it is evident that the utilization of the original sound in video recordings constitutes more than half of the available video data. This finding suggests that the non-background music (non-BGM) type of TikTok videos holds significance within the dataset. It is important to note that this category may encompass various scenarios, including videos where users choose to use another device to play background music or videos where the usage of background audio is not explicitly indicated or marked.

#### Exploring Song Titles

In [None]:
- Engagement vs. Song Title

In [None]:
clean_df.info()

In [None]:
engagement_features = ['Engagement Rate', 'Total Engagement', 'Likes', 'Shares', 'Comments', 'Views']
colors = ['blue', 'green', 'orange', 'purple', 'yellow', 'black']

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(10,8))
fig.subplots_adjust(hspace=0.4)

for i, feature in enumerate(engagement_features):
    
    top_songs = clean_df.nlargest(5, feature)[['Song Title', feature]]
    top_songs = top_songs.sort_values(feature, ascending=True)
    
    row = i // 2
    col = i % 2
    
    axes[row, col].barh(top_songs['Song Title'], top_songs[feature], color=colors[i])
    axes[row, col].set_xlabel(feature)
    #axes[row, col].set_ylabel('Song Title')
    axes[row, col].set_title(f'Top 5 Song Titles by {feature}')

plt.tight_layout()
plt.show()

Based on the observations made, it is evident that the term 'original sound' frequently appears in the bar plot, predominantly due to its association with the group of videos that do not include background music. To further analyze the relationship between songs and engagement, it is essential to remove the 'original sound' value from the dataset and focus on the remaining song titles.

In [None]:
song_df = clean_df[clean_df['Song Title'] != 'original sound']

song_df.info()

In [None]:
# re-do the visulization

engagement_features = ['Engagement Rate', 'Total Engagement', 'Likes', 'Shares', 'Comments', 'Views']
colors = ['blue', 'green', 'orange', 'purple', 'yellow', 'black']

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(10,8))
fig.subplots_adjust(hspace=0.4)

for i, feature in enumerate(engagement_features):
    
    top_songs = song_df.nlargest(5, feature)[['Song Title', feature]]
    top_songs = top_songs.sort_values(feature, ascending=True)
    
    row = i // 2
    col = i % 2
    
    axes[row, col].barh(top_songs['Song Title'], top_songs[feature], color=colors[i])
    axes[row, col].set_xlabel(feature)
    #axes[row, col].set_ylabel('Song Title')
    axes[row, col].set_title(f'Top 5 Song Titles by {feature}')

plt.tight_layout()
plt.show()

Based on the insights derived from the above bar chart, it is interesting to observe that certain songs appear multiple times in the dataset. Notably, some songs exhibit a strong presence in the "Total Engagement" ranking, indicating their popularity based on cumulative engagement metrics such as likes, shares, comments, and views. However, these same songs may not feature prominently in the "Engagement Rate" ranking, suggesting that the number of followers has a significant influence on the overall engagement rate.

Furthermore, it is important to note that not all the songs identified in the analysis are specific to the timeframe of data collection, which spans 2020 and 2021. Some songs may be classified as classic music or popular during certain holiday seasons. This indicates that the TikTok community's preferences extend beyond the contemporary hit songs and encompasses a broader range of musical genres and time periods.

The relationship between a song's popularity on TikTok and its status as a hit song outside the platform is complex and challenging to ascertain definitively. It is possible that songs gain popularity on TikTok and subsequently become hit songs, or vice versa, where already popular songs find their way onto the TikTok platform and further enhance their reach and popularity. The interplay between TikTok trends and mainstream music trends contributes to the dynamic nature of song popularity on the platform.

#### Engagement Column Correlation

In [None]:
clean_df.info()

In [None]:
numeric_df = clean_df.copy()

numeric_df = numeric_df.drop(['column1', 'column2'], axis=1)

In [None]:
X = clean_df.drop('Engagement Rate', axis=1)
y = clean_df['Engagement Rate'].copy()

In [None]:
X.corr()

In [None]:
# EDA

# Check peoid distribution

# check the distribution of period

clean_df['timestamp_period'].value_counts()

category_order = ['Morning', 'Afternoon', 'Evening', 'Midnight']

sns.countplot(data=clean_df, x='timestamp_period', order=category_order)

plt.xlabel('Timestamp Period')
plt.ylabel('Count')
plt.title('Distribution of Timestamp Period')

plt.show()