
<img src="images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**



**How can you best prepare to understand and organize the provided information?**
- Review the project scenario and dataset structure
- Clarify the meaning of “claim” and “opinion”
- Use `head()`, `info()`, and `describe()` to explore the data

**What codebooks or resources will help you?**
- Coursera example notebooks
- pandas and NumPy documentation
- Quick guides for data cleaning

**What else should you do before coding?**
- Review column names and data types
- Check for missing or unusual values
- Plan basic data cleaning steps (`fillna()`, `dropna()`, `astype()`)


<img src="images/Analyze.png" width="100" height="100" align=left>

## **PACE: Analyze**



**Is the available information enough to meet the project goal?**
- The data seems well-structured and suitable for initial analysis
- Some columns may need closer review for missing or unclear values

**How would you summarize the data and check min/max ranges?**
- Use `df.describe()` for summary stats
- Use `value_counts()` or `groupby()` for categorical columns
- Use `df.min()` and `df.max()` to check value ranges and outliers

**Do any averages look unusual? Can you describe interval data?**
- Compare averages to min/max values to spot outliers
- Use `boxplot` or z-scores to detect anomalies
- Review interval data (e.g., video length) for logical consistency


In [2]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv("tiktok_dataset.csv")

In [4]:
data.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


**Question 1:** When reviewing the first few rows of the dataframe, what do you observe about the data? What does each row represent?

**Answer:** There are the column names showing up. And the first 10 rows claim_status value is "claim".

In [26]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


**Question 2:** When reviewing the `data.info()` output, what do you notice about the different variables? Are there any null values? Are all the variables numeric? Does anything else stand out?


**Answer:** Yes I see that there must be null values because the non-null count for each is not the same. And no, all the variables are not numeric.

In [25]:
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


**Question 3:** When reviewing the `data.describe()` output, what do you notice about the distributions of each variable? Are there any questionable values? Does it seem that there are outlier values?

**Answer:**

In [39]:
# What are the different values for claim status and how many of each are in the data?
grouped_by_status = data.groupby("claim_status").describe()
grouped_by_status

claim_status  #      video_id    video_duration_sec  video_transcription_text                                                                                                                   verified_status  author_ban_status  video_view_count  video_like_count  video_share_count  video_download_count  video_comment_count
claim         1      7017666017  59                  someone shared with me that drone deliveries are already happening and will become common by 2025                                          not verified     under review       343296.0          19425.0           241.0              1.0                   0.0                    1
              2      4014381136  32                  someone shared with me that there are more microorganisms in one teaspoon of soil than people on the planet                                not verified     active             140877.0          77355.0           19034.0            1161.0                684.0                  1
              3

In [44]:
claim_mask = data["claim_status"] == "claim"

print(data[claim_mask]["video_view_count"].mean())
print(data[claim_mask]["video_view_count"].median())

opinion_mask = data["claim_status"] == "opinion"

print(data[opinion_mask]["video_view_count"].mean())
print(data[opinion_mask]["video_view_count"].median())

501029.4527477102
501555.0
4956.43224989447
4953.0
