# **TikTok Project**
**Course 4 - The Power of Statistics**

You are a data professional at TikTok. The current project is reaching its midpoint; a project proposal, Python coding work, and exploratory data analysis have all been completed.

The team has reviewed the results of the exploratory data analysis and the previous executive summary the team prepared. You received an email from Orion Rainier, Data Scientist at TikTok, with your next assignment: determine and conduct the necessary hypothesis tests and statistical analysis for the TikTok classification project.

A notebook was structured and prepared to help you in this project. Please complete the following questions.


# **Course 4 End-of-course project: Data exploration and hypothesis testing**

In this activity, you will explore the data provided and conduct hypothesis testing.
<br/>

**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze hypothesis tests.

**The goal** is to apply descriptive and inferential statistics, probability distributions, and hypothesis testing in Python.
<br/>

*This activity has three parts:*

**Part 1:** Imports and data loading
* What data packages will be necessary for hypothesis testing?

**Part 2:** Conduct hypothesis testing
* How will descriptive statistics help you analyze your data?

* How will you formulate your null hypothesis and alternative hypothesis?

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerge from your hypothesis test?

* What business recommendations do you propose based on your results?

<br/>

Follow the instructions and answer the questions below to complete the activity. Then, complete an executive summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.



# **Data exploration and hypothesis testing**

<img src="images/Pace.png" width="100" height="100" align=left>

# **PACE stages**

Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

<img src="images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**

Consider the questions in your PACE Strategy Document and those below to craft your response.

1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test. Consider your research question now, at the start of this task.

**Research Question:**
Does the **claim status** of TikTok videos (Claim vs. Opinion) significantly affect the engagement levels (such as views, likes, comments, etc.) on these videos? Specifically, we want to test if **claim videos** have higher engagement metrics compared to **opinion videos**.

This question will guide the formulation of our null and alternative hypotheses for the hypothesis testing phase of the project.

*Complete the following steps to perform statistical analysis of your data:*

### **Task 1. Imports and Data Loading**

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

<details>
  <summary><h4><strong>Hint:</strong></h4></summary>

Be sure to import `pandas`, `numpy`, `matplotlib.pyplot`, `seaborn`, and `scipy`.

</details>

In [1]:
# Importing necessary packages and libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Check if the necessary libraries are imported successfully
print("Libraries imported successfully!")

Libraries imported successfully!


Load the dataset.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

<img src="images/Analyze.png" width="100" height="100" align=left>

<img src="images/Construct.png" width="100" height="100" align=left>

## **PACE: Analyze and Construct**

Consider the questions in your PACE Strategy Document and those below to craft your response:
1. Data professionals use descriptive statistics for Exploratory Data Analysis. How can computing descriptive statistics help you learn more about your data in this stage of your analysis?


### PACE: Analyze and Construct

**How can computing descriptive statistics help you learn more about your data in this stage of your analysis?**

Computing descriptive statistics is essential in the analysis phase as it provides a summary of the central tendencies, dispersion, and shape of the data distribution, helping us to understand its structure. It allows for the identification of patterns, trends, and outliers, which are crucial for forming hypotheses and making decisions about the next steps in the analysis. Descriptive statistics such as mean, median, mode, standard deviation, and quartiles will give us insights into the spread and variability of the data, helping to determine whether the data is normally distributed or skewed. Additionally, understanding the distribution and central tendency of key variables is vital when deciding on the most appropriate statistical tests for hypothesis testing or for identifying relationships between variables.

By calculating and visualizing these statistics, we can determine whether there are potential outliers or discrepancies in the data and assess its quality. This step will help us define the boundaries for further analysis and refine our hypotheses, contributing to more accurate and reliable conclusions when conducting hypothesis tests.

### **Task 2. Data exploration**

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).



<details>
  <summary><h4><strong>Hint:</strong></h4></summary>

Refer back to *Self Review Descriptive Statistics* for this step-by-step proccess.

</details>

Inspect the first five rows of the dataframe.

In [3]:
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [4]:
# Generate a table of descriptive statistics about the data
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


Check for and handle missing values.

In [5]:
# Check for missing values
data.isnull().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

In [6]:
# Drop rows with missing values
data_cleaned = data.dropna()

In [7]:
# Display first few rows after handling missing values
data_cleaned.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


You are interested in the relationship between `verified_status` and `video_view_count`. One approach is to examine the mean value of `video_view_count` for each group of `verified_status` in the sample data.

In [8]:
# Compute the mean `video_view_count` for each group in `verified_status`
mean_video_views_by_verified = data_cleaned.groupby('verified_status')['video_view_count'].mean()
mean_video_views_by_verified

verified_status
not verified    265663.785339
verified         91439.164167
Name: video_view_count, dtype: float64

### **Task 3. Hypothesis testing**

Before you conduct your hypothesis test, consider the following questions where applicable to complete your code response:

1. Recall the difference between the null hypothesis and the alternative hypotheses. What are your hypotheses for this data project?

Null Hypothesis (H₀):
The null hypothesis represents the assumption that there is no significant difference or effect. In this case, it suggests that there is no relationship between the verified_status and video_view_count.

Alternative Hypothesis (H₁):
The alternative hypothesis proposes that there is a significant relationship or difference between the verified_status and video_view_count. In other words, the mean video view counts are different between verified and non-verified users.



Your goal in this step is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis



Null Hypothesis (H₀):
There is no difference in mean video view counts between verified and non-verified users.

Alternative Hypothesis (H₁):
There is a significant difference in mean video view counts between verified and non-verified users.

You choose 5% as the significance level and proceed with a two-sample t-test.

In [9]:
print(data["video_view_count"].isna().sum())

298


In [10]:
print("Verified count:", data[data["verified_status"] == "verified"].shape[0])
print("Not verified count:", data[data["verified_status"] == "not verified"].shape[0])

Verified count: 1240
Not verified count: 18142


In [11]:
print(data["verified_status"].unique())

['not verified' 'verified']


In [12]:
print(data["video_view_count"].dtype)

float64


In [13]:
print("Unique values in verified_status column:")
print(data["verified_status"].unique())

print("\nNumber of verified users:")
print(data[data["verified_status"] == "verified"].shape[0])

print("\nNumber of not verified users:")
print(data[data["verified_status"] == "not verified"].shape[0])

Unique values in verified_status column:
['not verified' 'verified']

Number of verified users:
1240

Number of not verified users:
18142


In [14]:
data = data.dropna(subset=["video_view_count"])

In [15]:
not_verified = data[data["verified_status"] == "not verified"]["video_view_count"]
verified = data[data["verified_status"] == "verified"]["video_view_count"]

t_stat, p_value = stats.ttest_ind(a=not_verified, b=verified, equal_var=False)
print("T-statistic:", t_stat)
print("P-value:", p_value)

T-statistic: 25.499441780633777
P-value: 2.6088823687177823e-120


**Question:** Based on the p-value you got above, do you reject or fail to reject the null hypothesis?


Based on the p-value of approximately 2.61 × 10⁻¹²⁰, which is far below the significance level of 0.05, we reject the null hypothesis.
This provides strong statistical evidence that there is a significant difference in mean video view counts between verified and non-verified TikTok users.

<img src="images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**

Consider the questions in your PACE Strategy Documentto reflect on the Execute stage.

## **Step 4: Communicate insights with stakeholders**

*Ask yourself the following questions:*

1. What business insight(s) can you draw from the result of your hypothesis test?

Insight from Hypothesis Test
The analysis reveals a statistically significant difference in video view counts between verified and non-verified TikTok users. Verified users tend to receive higher average view counts.

Business Insights & Implications
Verified Status Positively Impacts Reach

Being verified on TikTok is associated with significantly higher video view counts.

This implies that verification might contribute to increased visibility in the TikTok algorithm or user trust.

Incentivize Verification for Influencers & Brands

Encourage creators, influencers, or brand accounts to pursue verification as a growth strategy.

Consider partnerships or outreach to help high-potential creators become verified to improve platform engagement.

Use Verified Status in Marketing Strategy

When selecting creators for campaigns, verified status can be used as a potential performance indicator.

Brands targeting higher reach may prioritize partnerships with verified creators.

Content Strategy Adjustments

For non-verified users aiming to increase visibility, analyze verified users' strategies — posting times, formats, hashtags, etc.

TikTok can offer guidance or tools for non-verified users to grow and potentially achieve verification.

Suggest Next Steps
Explore Causal Factors: Does verification cause higher views, or do high-performing users get verified? Further analysis (e.g., regression, time-series) could explore this.

Segment by Niche or Content Type: Does the verification effect vary across different content categories (e.g., beauty vs. gaming)?

Evaluate Engagement Metrics: Extend analysis to likes, shares, and comments for a fuller performance picture.