## Dataset Context (Social Media Engagement Data)

This dataset contains information about **social media posts** and how users interact with them.
Typical columns include:

* Platform (Instagram, Twitter, TikTok, etc.)
* Content type (Image, Video, Text)
* Likes
* Shares
* Comments
* Engagement metrics
* Post timing and metadata

A data scientist uses Pandas here to:

* Understand user behavior
* Identify high-engagement content
* Clean missing or inconsistent data
* Prepare the data for analysis or modeling

In [18]:
import pandas as pd
import numpy as np

data = {
    'Platform': ['Instagram', 'TikTok', 'Twitter', 'Instagram', 'TikTok', 'Twitter', 'Instagram', 'TikTok', 'Twitter', 'Instagram'],
    'Content_Type': ['Image', 'Video', 'Text', 'Video', 'Video', 'Text', 'Image', 'Video', 'Text', 'Image'],
    'Likes': [1200, 3500, 45, 2300, 5000, 12, 1500, 4200, 60, 1100],
    'Shares': [50, 200, 5, 120, 300, 2, 60, 250, 8, 45],
    'Comments': [30, 100, 10, 80, 150, 5, 40, 120, 12, 25]
}

df_dummy = pd.DataFrame(data)
df_dummy.to_csv('social_media_viral_content.csv', index=False)
print("'social_media_viral_content.csv' has been created successfully!")

'social_media_viral_content.csv' has been created successfully!


## Loading the Dataset

Key Pandas idea: **Once data is in a DataFrame, EDA begins immediately.**

In [22]:
df1 = pd.read_csv('contract_events_23363371.csv')

In [19]:
import pandas as pd

df = pd.read_csv("social_media_viral_content.csv")

## First Look at the Data (Structure & Shape)

### 1. `head()` and `tail()`

Used to quickly inspect rows.

Purpose:
* Confirm the data loaded correctly
* See column names and example values

In [24]:
print("--- Head ---")
display(df.head(9))

print("\n--- Tail ---")
display(df.tail(6))

--- Head ---


Unnamed: 0,Platform,Content_Type,Likes,Shares,Comments
0,Instagram,Image,1200,50,30
1,TikTok,Video,3500,200,100
2,Twitter,Text,45,5,10
3,Instagram,Video,2300,120,80
4,TikTok,Video,5000,300,150
5,Twitter,Text,12,2,5
6,Instagram,Image,1500,60,40
7,TikTok,Video,4200,250,120
8,Twitter,Text,60,8,12



--- Tail ---


Unnamed: 0,Platform,Content_Type,Likes,Shares,Comments
4,TikTok,Video,5000,300,150
5,Twitter,Text,12,2,5
6,Instagram,Image,1500,60,40
7,TikTok,Video,4200,250,120
8,Twitter,Text,60,8,12
9,Instagram,Image,1100,45,25


### 2. `shape`

Shows the size of the dataset.

Returns:
```
(number_of_rows, number_of_columns)
```

This helps estimate:
* Dataset size
* Computational complexity
* Data sufficiency

In [4]:
df.shape

(10, 5)

### 3. `columns`

Lists all column names.

Important for:
* Understanding features
* Selecting or renaming columns

In [5]:
df.columns

Index(['Platform', 'Content_Type', 'Likes', 'Shares', 'Comments'], dtype='object')

## Understanding Data Types

### 4. `info()`

One of the **most important EDA methods**.

Shows:
* Column names
* Data types
* Non-null counts
* Memory usage

Why this matters:
* Identifies missing values
* Detects wrong data types (e.g., numbers stored as strings)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Platform      10 non-null     object
 1   Content_Type  10 non-null     object
 2   Likes         10 non-null     int64 
 3   Shares        10 non-null     int64 
 4   Comments      10 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 532.0+ bytes


## Descriptive Statistics

### 5. `describe()`

Provides statistical summary for numeric columns.

Includes:
* Count
* Mean
* Standard deviation
* Minimum and maximum
* Quartiles

In this dataset:
* Helps understand engagement ranges
* Detects unusually high or low engagement

In [7]:
df.describe()

Unnamed: 0,Likes,Shares,Comments
count,10.0,10.0,10.0
mean,1891.7,104.0,57.2
std,1804.330596,109.01478,51.613521
min,12.0,2.0,5.0
25%,320.0,17.25,15.25
50%,1350.0,55.0,35.0
75%,3200.0,180.0,95.0
max,5000.0,300.0,150.0


### 6. `describe(include="object")`

For categorical data.

Useful for:
* Content type frequency
* Platform distribution
* Detecting inconsistent labels

In [8]:
df.describe(include="object")

Unnamed: 0,Platform,Content_Type
count,10,10
unique,3,3
top,Instagram,Video
freq,4,4


## Handling Missing Data

### 7. `isna()` and `sum()`

Purpose:
* Count missing values per column
* Decide whether to drop or fill data

In [19]:
df.isna().sum()

Platform            0
Content_Type        0
Likes               0
Shares              0
Comments            0
Total_Engagement    0
dtype: int64

### 8. `dropna()` and `fillna()`

Common in social media data where:
* Engagement metrics may be missing
* Some posts lack interaction data

In [10]:
# Example of checking what dropna would do (creates a new copy)
print("Shape after dropna:", df.dropna().shape)

# Example of fillna
df_filled = df.fillna(0)

Shape after dropna: (10, 5)


## Column Selection and Filtering

### 9. Selecting Columns

Used to:
* Focus on engagement metrics
* Create subsets for analysis

In [11]:
# Single column (returns Series)
display(df["Likes"].head())

# Multiple columns (returns DataFrame)
display(df[["Likes", "Shares", "Comments"]].head())

0    1200
1    3500
2      45
3    2300
4    5000
Name: Likes, dtype: int64

Unnamed: 0,Likes,Shares,Comments
0,1200,50,30
1,3500,200,100
2,45,5,10
3,2300,120,80
4,5000,300,150


### 10. Filtering Rows

Data science use:
* Identify viral posts
* Filter high-performing content

In [12]:
# Filter for highly liked posts
viral_posts = df[df["Likes"] > 1000]
viral_posts

Unnamed: 0,Platform,Content_Type,Likes,Shares,Comments
0,Instagram,Image,1200,50,30
1,TikTok,Video,3500,200,100
3,Instagram,Video,2300,120,80
4,TikTok,Video,5000,300,150
6,Instagram,Image,1500,60,40
7,TikTok,Video,4200,250,120
9,Instagram,Image,1100,45,25


## Sorting and Ranking

### 11. `sort_values()`

Helps answer:
* Which posts performed best?
* Which platform has the highest engagement?

In [21]:
# Sort by Likes in descending order
df.sort_values(by="Likes", ascending=False).head()

Unnamed: 0,Platform,Content_Type,Likes,Shares,Comments,Total_Engagement
4,TikTok,Video,5000,300,150,5450
7,TikTok,Video,4200,250,120,4570
1,TikTok,Video,3500,200,100,3800
3,Instagram,Video,2300,120,80,2500
6,Instagram,Image,1500,60,40,1600


## Grouping and Aggregation (Very Important in EDA)

### 12. `groupby()`

Why it matters:
* Compare platforms
* Understand average engagement patterns
* Core method for business insights

In [27]:
# Average Likes per Platform
df.groupby("Platform")["Likes"].mean()

Platform
Instagram    1525.000000
TikTok       4233.333333
Twitter        39.000000
Name: Likes, dtype: float64

## Creating New Columns (Feature Engineering)

### 13. Creating Derived Metrics

Used to:
* Create meaningful features
* Combine multiple engagement signals

In [15]:
df["Total_Engagement"] = df["Likes"] + df["Shares"] + df["Comments"]
df.head()

Unnamed: 0,Platform,Content_Type,Likes,Shares,Comments,Total_Engagement
0,Instagram,Image,1200,50,30,1280
1,TikTok,Video,3500,200,100,3800
2,Twitter,Text,45,5,10,60
3,Instagram,Video,2300,120,80,2500
4,TikTok,Video,5000,300,150,5450


## Checking Duplicates

### 14. `duplicated()` and `drop_duplicates()`

Important because:
* Social media datasets often contain repeated records
* Duplicates distort analysis

In [28]:
print("Duplicates:", df.duplicated().sum())

# To drop them (returns a new DataFrame or use inplace=True)
# df = df.drop_duplicates()

Duplicates: 0


## Value Counts (Categorical Analysis)

### 15. `value_counts()`

Helps answer:
* Most common content type
* Platform usage distribution

In [31]:
df["Content_Type"].value_counts()

Content_Type
Video    4
Image    3
Text     3
Name: count, dtype: int64

## Why Pandas is Critical in EDA

Pandas allows a data scientist to:
* Understand data structure
* Clean messy datasets
* Extract insights quickly
* Prepare data for visualization and modeling

In real-world data science workflows:

> **80% of the work happens in Pandas before modeling begins.**