# T-Test with Python · Short Videos App

Completed by [Anton Starshev](http://linkedin.com/in/starshev) on 18/04/2024

### Context

According to the fictional project scenario, I am a member of TikTok's data analytics team that has completed the first three milestones of the claims classification project. 

Project management officers inform the data team about a new request: to determine whether there is a statistically significant difference in the number of views for TikTok videos posted by verified accounts versus unverified accounts.

### Data

This project uses a dataset called **tiktok_dataset.csv**. It contains synthetic data created for this project in partnership with TikTok.

### Execution

The project is divided into four key phases that are carried out step by step:

1. Importing necessary Python packages and loading the dataset
2. Wrangling and exploring the project data
3. Implement a hypothesis test
4. Formulating business insights and recommendations

### 1 · Data Loading

Imported packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [1]:
import pandas as pd
from scipy import stats

Loaded the scenario dataset into a DataFrame.

In [5]:
df = pd.read_csv("tiktok_dataset.csv", index_col = 0)

### 2 · Data Wrangling and Exploration

Previewed the loaded data.

In [20]:
df.head(3)

Unnamed: 0_level_0,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0


Checked the data size.

In [7]:
df.shape

(19382, 11)

Verified the data types and names of columns.

In [8]:
df.dtypes

claim_status                 object
video_id                      int64
video_duration_sec            int64
video_transcription_text     object
verified_status              object
author_ban_status            object
video_view_count            float64
video_like_count            float64
video_share_count           float64
video_download_count        float64
video_comment_count         float64
dtype: object

Used descriptive statistics to conduct Exploratory Data Analysis (EDA) on the video view counts.

In [12]:
df[['video_view_count']].describe(include = 'all')

Unnamed: 0,video_view_count
count,19084.0
mean,254708.558688
std,322893.280814
min,20.0
25%,4942.5
50%,9954.5
75%,504327.0
max,999817.0


Checked for and handled missing values.

In [14]:
df.isnull().sum()

claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

In [21]:
df.dropna(axis = 0, inplace = True)
df.isnull().sum()

claim_status                0
video_id                    0
video_duration_sec          0
video_transcription_text    0
verified_status             0
author_ban_status           0
video_view_count            0
video_like_count            0
video_share_count           0
video_download_count        0
video_comment_count         0
dtype: int64

Checked for duplicated rows.

In [16]:
df.duplicated().sum()

0

Since I was interested in the relationship between account status and video view count, one approach was to examine the mean value of video view count for each group of verified or not verified accounts in the sample data.

In [17]:
df.groupby('verified_status')[['video_view_count']].mean()

Unnamed: 0_level_0,video_view_count
verified_status,Unnamed: 1_level_1
not verified,265663.785339
verified,91439.164167


**Observation**: Firstly, it is now confirmed that we have only two account status groups for research.

Secondly, according to my initial exploration, videos posted by verified users tend to receive significantly fewer views on average than those published by unverified accounts. However, I needed to demonstrate that this disparity was not a result of sample variability. Therefore, the subsequent step was to assess the statistical significance of this distinction through hypothesis testing.

### 3 · Hypothesis Test

Stated the null hypothesis and the alternative hypothesis:

**H<sub>0</sub>**: There is no distinction in the number of views between TikTok videos posted by verified accounts and those posted by unverified accounts (any divergence observed in the sample data is attributable to chance or sampling variability).

**H<sub>1</sub>**: There is a difference in the number of views between TikTok videos posted by verified accounts and those posted by unverified accounts (any observed difference in the sample data is due to an actual difference in the corresponding population means).

Assigned a **5% significance level** to the hypothesis test.

Determined the type of hypothesis testing: **two-sample two-tailed t-test**.

Filtered the data into two groups based on the account status: verified or not verified.

In [18]:
verified = df[df['verified_status'] == 'verified'].video_view_count

not_verified = df[df['verified_status'] == 'not verified'].video_view_count

Conducted the hypothesis test using SciPy Stats.

In [19]:
stats.ttest_ind(a = verified, b = not_verified, equal_var = False, 
alternative = 'two-sided')

TtestResult(statistic=-25.499441780633777, pvalue=2.6088823687177823e-120, df=1571.163074387424)

**Test Result**: Since the p-value is significantly lower than the 5% significance level, I rejected the null hypothesis.

### 4 · Insight and Recommendation

**Business Insight**: Based on the conducted test, the key business insight is that there is a statistically significant difference in the average number of views of videos created by verified versus unverified accounts. Specifically, unverified accounts receive much more attention.

**Communication and Recommendation**: *The analysis revealed potential fundamental behavioral disparities between the account categories. Exploring the underlying cause of this behavioral contrast would be intriguing. For instance, do unverified accounts tend to share more engaging videos, or are unverified accounts associated with any kind of spam bots.*

### Acknowledgment

I would like to express gratitude to Google and Coursera for supporting the educational process and providing the opportunity to refine and showcase skills acquired during the courses by completing real-life scenario portfolio projects, such as this.

### Reference

This is an end-of-course workplace scenario project *«TikTok, created in partnership with the short-form video hosting company»* proposed within the syllabus of *Google Advanced Data Analytics Professional Certificate* on Coursera.