# **TikTok Project**
**Stage 4 - The Power of Statistics**

In this stage, it will be determined and conducted the necessary hypothesis tests and statistical analysis for the TikTok classification project.

### **Task 1. Imports and Data Loading**

In [1]:
# Import packages for data manipulation
import pandas as pd

# Import packages for data visualization
import seaborn as sns 
import matplotlib.pyplot as plt

# Import packages for statistical analysis/hypothesis testing
from scipy import stats

In [2]:
# Load dataset into dataframe
data = pd.read_csv("../dataset/tiktok_dataset.csv")

### **Task 2. Data exploration**
It will be used descriptive statistics to conduct Exploratory Data Analysis (EDA).

In [3]:
# Display first few rows
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [4]:
# Generate a table of descriptive statistics about the data
data.describe(include='all')

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19084,19382.0,19382.0,19084,19382,19382,19084.0,19084.0,19084.0,19084.0,19084.0
unique,,2,,,19012,2,3,,,,,
top,,claim,,,a friend read in the media a claim that badmi...,not verified,active,,,,,
freq,,9608,,,2,18142,15663,,,,,
mean,9691.5,,5627454000.0,32.421732,,,,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,,2536440000.0,16.229967,,,,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,,1234959000.0,5.0,,,,20.0,0.0,0.0,0.0,0.0
25%,4846.25,,3430417000.0,18.0,,,,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,,5618664000.0,32.0,,,,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,,7843960000.0,47.0,,,,504327.0,125020.0,18222.0,1156.25,292.0


Let's check for and handle missing values.

In [5]:
# Check for missing values
null_mask = data.isnull()
null_mask = null_mask.any(axis=1)
data[null_mask]

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
19084,19085,,4380513697,39,,not verified,active,,,,,
19085,19086,,8352130892,60,,not verified,active,,,,,
19086,19087,,4443076562,25,,not verified,active,,,,,
19087,19088,,8328300333,7,,not verified,active,,,,,
19088,19089,,3968729520,8,,not verified,active,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
19377,19378,,7578226840,21,,not verified,active,,,,,
19378,19379,,6079236179,53,,not verified,active,,,,,
19379,19380,,2565539685,10,,verified,under review,,,,,
19380,19381,,2969178540,24,,not verified,active,,,,,


In [6]:
# Drop rows with missing values
data = data.dropna(axis=0)

In [7]:
# Display first few rows after handling missing values
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [8]:
data.shape

(19084, 12)

Based on what has been seen in past stages, it will be good to explore the relationship between `verified_status` and `video_view_count`. One approach is to examine the mean value of `video_view_count` for each group of `verified_status` in the sample data.

In [13]:
# First, let's compute the mean `video_view_count` for each group in `verified_status`

# Two mask in order to separate both sort of accounts. 
not_verified_mask = data['verified_status']=='not verified'
verified_mask = data['verified_status']=='verified'

# Let's get the means.
not_verified_mean = data[not_verified_mask]['video_view_count'].mean()
print(f'Not Verified video view count mean: {not_verified_mean: .2f}')

verfied_mean = data[verified_mask]['video_view_count'].mean()
print(f'Verified video view count mean: {verfied_mean: .2f}')

Not Verified video view count mean:  265663.79
Verified video view count mean:  91439.16


### **Task 3. Hypothesis testing**
Before you conduct your hypothesis test, let's recall what the null and alternative hypotheses are.

- The null hypothesis represents the "no effect" scenario. It states that there is no difference between groups, or essentially, nothing interesting is happening. We often try to disprove the null hypothesis.
- The alternative hypothesis is your actual prediction or claim. It states the opposite of the null hypothesis and suggests that there is a difference, or some effect you're interested in.

The goal in this step is to conduct a two-sample t-test. Let's recall the steps for conducting a hypothesis test:

1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis

Now, let's set the null and alternative hypotheses. We are interested in the relationship between the video views and the verify status. It seems that unverified users tend to have more views than verified users, so our null and alternative hypotheses will be conducted in that way:

- H0: There is no an statistically significant difference of number of views between verified and unverified accounts.
- Ha: There is a statistically significant difference of number of views between verified and unverified accounts. 

We choose 5% as the significance level and let's proceed with a two-sample t-test.

In [14]:
# First, let' conduct a two-sample t-test to compare means
significance_level = 0.05
stats.ttest_ind(a=data[verified_mask]['video_view_count'], b=data[not_verified_mask]['video_view_count'], equal_var=False)

TtestResult(statistic=-25.499441780633777, pvalue=2.6088823687177823e-120, df=1571.163074387424)

Based of the p-value (2.60) the null hypothesis will be reject. Therfore, There is a statistically significant difference between the two sort of accounts. 

## **Step 4: Conclusions**

Unverified accounts represent a big part of the platform's content. Most of the views come from these accounts' videos. Tik Tok has to think about what they will do about this type of account, because even when these accounts are more likely to upload a claim video, it seems that they have a stronger engagement factor for many users. Tik Tok could take action favoring verified accounts or promoting all accounts to verify.