<a href="https://colab.research.google.com/github/turgeng/TikTok-Engagement-Cleaning-Analysis/blob/main/TikTok_Content_and_Engagement_Data_Quality_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **TikTok Content & Engagement Data Quality Analysis**


The goal of this project is to conduct an initial exploratory data analysis (EDA) of TikTok’s video dataset in preparation for building a claims classification model.

Through this process, I will:
- Acquaint myself with the dataset and its structure  
- Compile descriptive statistics to identify potential data quality issues  
- Examine variable distributions and potential outliers  

The ultimate objective is to summarize the dataset, evaluate its readiness for further analysis, and communicate preliminary insights that can guide the next stages of data modeling.


## Imports and Data Loading

To begin the analysis, I import the necessary Python libraries for data handling and exploration.  
Then, I load the provided TikTok dataset into a pandas DataFrame to inspect its structure, data types, and missing values.




In [None]:
import pandas as pd
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
data = pd.read_csv('/content/drive/MyDrive/TikTok_Project/tiktok_dataset.csv')


## Understand the data - Inspect the data

In [None]:
# Display and examine the first ten rows of the dataframe
data.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [None]:
# Get summary info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [None]:
# Get summary statistics
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


**Observation:**

Each row in the dataset represents a single TikTok video record that has been labeled as a *claim* or *opinion*.  
Each record contains various attributes such as:

- **video_id**: unique identifier for each video  
- **video_duration_sec**: duration of the video in seconds  
- **video_transcription_text**: text transcription of the spoken content  
- **verified_status**: whether the account is verified  
- **author_ban_status**: whether the author’s account is banned  
- **video_view_count**, **video_like_count**, **video_share_count**, **video_download_count**, **video_comment_count**: engagement metrics  

In summary, each row represents one TikTok post and its associated metadata, performance, and classification attributes.

- The dataset contains multiple data types: **integers**, **floats**, and **objects** (string-based text data).  
- There are **no null values** in the dataset, which means it is already relatively clean in terms of missingness.  
- Numerical variables (such as engagement counts and duration) coexist with categorical variables (like claim status or verification).  
- The mixture of numeric and text columns indicates that different preprocessing strategies will be required later (e.g., encoding categorical variables, handling long text fields).  

Overall, the structure appears well-defined and ready for exploratory analysis.


- Descriptive statistics show reasonable central tendencies and ranges for most numeric columns.  
- There are **no obvious anomalies or impossible values** (e.g., negative counts or zero durations).  
- The range between minimum and maximum values suggests some potential **outliers** in engagement metrics (like view or like counts), which is expected in social media data due to viral content.  

Overall, the data distributions appear plausible and representative of typical TikTok activity patterns.


# Understand the Data — Investigate the Variables

In this phase, the goal is to explore the dataset variables more closely to understand their characteristics and relationships.

The project’s ultimate objective is to prepare for a **machine learning model** that can classify TikTok videos as either *claims* or *opinions*.  
To begin, it is logical to start with the **`claim_status`** variable, which defines these two categories.

**Step 1:** Determine how many videos exist in each claim status category.  
This helps assess the class balance between “claim” and “opinion” videos.

**Step 2:** Investigate engagement trends — such as view, like, share, and comment counts — for each claim status.  
This will show whether claim-related content drives different audience interactions compared to opinion-based videos.

To do this, Boolean masking is used to filter the data by `claim_status`, and summary statistics (mean and median) are calculated for engagement metrics.


## 1- Investigate the `claim_status` Variable




In [None]:
data['claim_status'].value_counts()

Unnamed: 0_level_0,count
claim_status,Unnamed: 1_level_1
claim,9608
opinion,9476


## 2- Investigate Engagement Trends: Boolean Masking and Summary Statistics

In [None]:
# Separate the data based on claim status
claims = data[data['claim_status'] == 'claim']
opinions = data[data['claim_status'] == 'opinion']

# Calculate engagement statistics for each category
claim_view_mean = claims['video_view_count'].mean()
claim_view_median = claims['video_view_count'].median()

opinion_view_mean = opinions['video_view_count'].mean()
opinion_view_median = opinions['video_view_count'].median()

print('Mean view count (claims):', claim_view_mean)
print('Median view count (claims):', claim_view_median)
print('Mean view count (opinions):', opinion_view_mean)
print('Median view count (opinions):', opinion_view_median)


Mean view count (claims): 501029.4527477102
Median view count (claims): 501555.0
Mean view count (opinions): 4956.43224989447
Median view count (opinions): 4953.0


## Interpretation: Claim Status and Engagement Insights

**Claim Status Distribution**

The dataset includes two categories under the `claim_status` variable: **claim** and **opinion**.  
The counts show that the data is quite balanced:

| Claim Status | Count |
|---------------|-------|
| claim | 9,608 |
| opinion | 9,476 |

This balance indicates that both categories are nearly equally represented, which is beneficial for future model training and evaluation.

---

**Engagement Analysis**

| Metric | Claim Videos | Opinion Videos |
|---------|---------------|----------------|
| Mean view count | 501,029 | 4,956 |
| Median view count | 501,555 | 4,953 |

These results reveal a **substantial difference in engagement** between the two content types.  
Videos labeled as *claims* attract dramatically higher average and median view counts compared to *opinions*.  
This pattern suggests that claim-related content tends to spread more widely on the platform — possibly because claims spark greater curiosity, controversy, or discussion.

---

**Summary**

- The dataset is well-balanced between claims and opinions.  
- Claim videos exhibit significantly higher engagement metrics, especially in views.  
- This insight highlights a potential behavioral trend that the upcoming machine learning model could leverage for classification.


## Analyze Trends by Author Ban Status

Next, the analysis examines whether an author's **ban status** shows any relationship with the claim classification of their videos.  
This can help identify potential behavioral or policy-related patterns — for example, whether accounts posting claims are more likely to have been banned.

To explore this relationship, we can use the **`groupby()`** function to count how many videos fall into each combination of `claim_status` and `author_ban_status`.  
This provides a cross-tabulated view of how content type and account status interact.


In [None]:
# Count the number of videos for each combination of claim_status and author_ban_status
data.groupby(['claim_status', 'author_ban_status']).size().reset_index(name='video_count')


Unnamed: 0,claim_status,author_ban_status,video_count
0,claim,active,6566
1,claim,banned,1439
2,claim,under review,1603
3,opinion,active,8817
4,opinion,banned,196
5,opinion,under review,463


## Interpretation: Claim Status × Author Ban Status

- The majority of both *claim* and *opinion* videos come from **active authors**.  
- However, the proportion of **banned accounts** is notably higher among *claim* videos (1,439) compared to *opinion* videos (196).  
- The *under review* category also appears more frequently among *claim* videos (1,603 vs. 463), suggesting that claim-related content may trigger moderation or verification more often.  

**Summary Insight**

These findings suggest a possible correlation between content type and moderation activity.  
TikTok accounts posting **claim-type content** may be subject to **stricter scrutiny or higher risk of ban/review**, potentially due to misinformation or policy-violation detection systems.  
This variable could therefore be an important **predictor** in a future classification model.


## Investigate Engagement by Author Ban Status

Continuing the exploration of engagement metrics, this step focuses on the relationship between the **author’s ban status** and their videos’ performance.

By calculating the **median video share count** for each category of `author_ban_status`, we can identify whether banned or under-review authors tend to have higher or lower sharing activity.

Using the median helps minimize the effect of extreme values (viral videos) and gives a clearer view of the central engagement trend within each group.


In [None]:
# Calculate the median video share count for each author ban status
median_share_by_ban = data.groupby('author_ban_status')['video_share_count'].median().reset_index()

median_share_by_ban


Unnamed: 0,author_ban_status,video_share_count
0,active,437.0
1,banned,14468.0
2,under review,9444.0


## Interpretation: Median Shares by Author Ban Status

The difference between groups is striking:
- Videos from **banned authors** have the highest median share count (14,468), followed by those **under review** (9,444).  
- By contrast, videos from **active authors** have a median of only 437 shares.

This pattern suggests that **content from accounts that were later banned or flagged for review tends to spread more widely**, attracting significantly higher engagement.  
Such virality might be linked to controversial, sensational, or misleading content — which often draws attention but also increases the likelihood of policy violations.

**Insight**

Author ban status appears to correlate strongly with share activity.  
This feature could play an important role in future modeling tasks, either as:
- a predictive indicator of content risk, or  
- a control variable for moderation-related behavior.


## Aggregate Engagement Metrics by Author Ban Status

To further understand engagement behavior, the dataset can be grouped by each author's **ban status**.  
For each group, we calculate three summary statistics — **count**, **mean**, and **median** — for the following engagement metrics:

- `video_view_count`
- `video_like_count`
- `video_share_count`

The `agg()` function allows multiple calculations across multiple columns by passing a dictionary, where the **keys** are column names and the **values** are lists of the desired aggregation functions.


In [None]:
# Group by author ban status and calculate count, mean, and median
ban_status_summary = data.groupby('author_ban_status').agg({
    'video_view_count': ['count', 'mean', 'median'],
    'video_like_count': ['count', 'mean', 'median'],
    'video_share_count': ['count', 'mean', 'median']
}).reset_index()

# Display the result
ban_status_summary


Unnamed: 0_level_0,author_ban_status,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median
0,active,15383,215927.039524,8616.0,15383,71036.533836,2222.0,15383,14111.466164,437.0
1,banned,1635,445845.439144,448201.0,1635,153017.236697,105573.0,1635,29998.942508,14468.0
2,under review,2066,392204.836399,365245.5,2066,128718.050339,71204.5,2066,25774.696999,9444.0


## Interpretation: Engagement Summary by Author Ban Status

**Key Observations**

- **Active authors** account for the majority of videos (15K+), but their engagement levels (views, likes, shares) are considerably lower than those of banned or under-review authors.  
- **Banned authors** have *extraordinarily high median values* — particularly in shares (14,468) and views (448K). This suggests that banned accounts often produced highly viral content before being removed.  
- **Under-review authors** also show elevated engagement metrics compared to active users, which may indicate that their content tends to attract higher attention or controversy.  

---

**Interpretation**

There appears to be a **strong relationship between engagement level and moderation activity.**  
Accounts with unusually high engagement — especially in shares — are more likely to end up banned or under review.  
This could reflect:
- The spread of polarizing or misinformation-related content,  
- Algorithmic amplification of “claim” content, or  
- Enforcement policies reacting to viral reach.

Such insights are valuable for designing the future classification model, as they suggest that **author_ban_status** could act as a *predictive indicator* of content risk and virality patterns.


## Create New Columns for Engagement Rates

Raw engagement counts (likes, shares, comments) can be misleading because they depend on the total number of views.  
To get a clearer understanding of *how engaging* each video truly is, we calculate **engagement rate ratios**.

New columns:
- **`likes_per_view`** → number of likes ÷ number of views  
- **`comments_per_view`** → number of comments ÷ number of views  
- **`shares_per_view`** → number of shares ÷ number of views  

These ratios normalize engagement metrics and allow fairer comparison between videos with different audience sizes.


In [None]:
# Create new columns for engagement rates
data['likes_per_view'] = data['video_like_count'] / data['video_view_count']
data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']
data['shares_per_view'] = data['video_share_count'] / data['video_view_count']

# Preview the new columns
data[['video_view_count', 'video_like_count', 'video_comment_count', 'video_share_count',
      'likes_per_view', 'comments_per_view', 'shares_per_view']].head()


Unnamed: 0,video_view_count,video_like_count,video_comment_count,video_share_count,likes_per_view,comments_per_view,shares_per_view
0,343296.0,19425.0,0.0,241.0,0.056584,0.0,0.000702
1,140877.0,77355.0,684.0,19034.0,0.549096,0.004855,0.135111
2,902185.0,97690.0,329.0,2858.0,0.108282,0.000365,0.003168
3,437506.0,239954.0,584.0,34812.0,0.548459,0.001335,0.079569
4,56167.0,34987.0,152.0,4110.0,0.62291,0.002706,0.073175


**Interpretation: Engagement Ratios (Per-View Metrics)**

The new engagement metrics reveal meaningful differences in audience behavior across videos.  
- **Like rates** range from 5% to 62%, indicating that certain videos generate strong positive reactions.  
- **Comment rates** remain low (<0.5%), which is typical for short-form content.  
- **Share rates** vary widely — videos with rates above 10% likely have viral characteristics.

Overall, the ratios demonstrate that raw engagement counts can be misleading:  
a video with moderate view counts can still achieve higher audience interaction when normalized by view volume.


## Analyze Engagement Ratios by Claim and Author Status

Next, we will analyze engagement trends across both `claim_status` and `author_ban_status`.  
By combining these categories, we can explore whether engagement behavior (likes, comments, shares) differs based on both:
- The type of content (claim vs. opinion)
- The author's moderation status (active, under review, banned)

We’ll use the `groupby()` function on both columns, and `agg()` to compute:
- the count of videos in each group
- the mean (average) engagement ratio
- the median engagement ratio (to reduce the effect of outliers)


In [None]:
# Group by both claim_status and author_ban_status, then aggregate engagement ratios
engagement_summary = data.groupby(['claim_status', 'author_ban_status']).agg({
    'likes_per_view': ['count', 'mean', 'median'],
    'comments_per_view': ['count', 'mean', 'median'],
    'shares_per_view': ['count', 'mean', 'median']
}).reset_index()

# Display the result
engagement_summary


Unnamed: 0_level_0,claim_status,author_ban_status,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,median,count,mean,median,count,mean,median
0,claim,active,6566,0.329542,0.326538,6566,0.001393,0.000776,6566,0.065456,0.049279
1,claim,banned,1439,0.345071,0.358909,1439,0.001377,0.000746,1439,0.067893,0.051606
2,claim,under review,1603,0.327997,0.320867,1603,0.001367,0.000789,1603,0.065733,0.049967
3,opinion,active,8817,0.219744,0.21833,8817,0.000517,0.000252,8817,0.043729,0.032405
4,opinion,banned,196,0.206868,0.198483,196,0.000434,0.000193,196,0.040531,0.030728
5,opinion,under review,463,0.226394,0.228051,463,0.000536,0.000293,463,0.044472,0.035027


## Executive Summary – TikTok Engagement Analysis

**Overview**

The dataset includes 19,084 videos — 50.4% claims and 49.6% opinions.  
This balanced distribution allows reliable comparison between content types.

**Findings**

- **Claim videos** show consistently higher engagement ratios across likes, comments, and shares.  
- **Banned and under-review authors** exhibit slightly higher engagement than active authors, suggesting their content may be more attention-grabbing or polarizing.  
- Mean `likes_per_view` ≈ 0.33 for claim videos vs. 0.22 for opinions.  
- Mean `shares_per_view` ≈ 0.065 for claim videos vs. 0.044 for opinions.

**Interpretation**

These patterns indicate that:
- Claim-based content tends to elicit stronger viewer reactions.
- High engagement often coincides with moderation actions (ban or review).
- Engagement metrics can thus serve as **predictive indicators** for future classification models aiming to identify potentially risky or viral content.

**Next Steps**

- Explore additional metadata (e.g., posting time, hashtags, caption sentiment) to improve feature richness.  
- Consider further hypothesis testing and model training to validate whether engagement behavior predicts claim classification.
