# Clean & Analyze Social Media

## Introduction

Social media has become a ubiquitous part of modern life, with platforms such as Instagram, Twitter, and Facebook serving as essential communication channels. Social media data sets are vast and complex, making analysis a challenging task for businesses and researchers alike. In this project, we explore a simulated social media, for example Tweets, data set to understand trends in likes across different categories.

## Prerequisites

To follow along with this project, you should have a basic understanding of Python programming and data analysis concepts. In addition, you may want to use the following packages in your Python environment:

- pandas
- Matplotlib

These packages should already be installed in Coursera's Jupyter Notebook environment, however if you'd like to install additional packages that are not included in this environment or are working off platform you can install additional packages using `!pip install packagename` within a notebook cell such as:

- `!pip install pandas`
- `!pip install matplotlib`

## Project Scope

The objective of this project is to analyze tweets (or other social media data) and gain insights into user engagement. We will explore the data set using visualization techniques to understand the distribution of likes across different categories. Finally, we will analyze the data to draw conclusions about the most popular categories and the overall engagement on the platform.

## Step 1: Importing Required Libraries

As the name suggests, the first step is to import all the necessary libraries that will be used in the project. In this case, we need pandas, numpy, matplotlib, seaborn, and random libraries.

Pandas is a library used for data manipulation and analysis. Numpy is a library used for numerical computations. Matplotlib is a library used for data visualization. Seaborn is a library used for statistical data visualization. Random is a library used to generate random numbers.

In [1]:
import csv # Python library used for reading and writing tabular data in CSV format.
import pandas as pd # Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data. 
import json # Python library for encoding and decoding custom objects by using JSON encoder and decoder classes.
import numpy as np # Python library for adding support to large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
import seaborn as sns #Python library for statistical visualizations
import statistics as stat #Python library for statistical operations
import matplotlib.pyplot as plt # Python library for creating graphs
import random # Python library for generating random data

from scipy import stats # Python library for linear regression

random.seed(4) # Set the seed for generating the same random data.
np.random.seed(4)
total = 1000

In [2]:
# Define various lists for data generation

categories = ["Food", "Travel", "Fashion", "Fitness", "Music", "Culture", "Family", "Health"]
content = ["Image", "Video", "Story"]
objectives = ["Increase Engagement", "Brand Awareness", "Conversions", "Lead Generation"]
days = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
time = ["Morning", "Afternoon", "Evening", "Night"]
sentiment = ["Positive", "Neutral", "Negative"]
campaigns = ["Campaign 1", "Campaign 2", "Campaign 3", "Campaign 4", "Campaign 5"]
platforms = ["Instagram", "Twitter-X", "Facebook", "LinkedIn"]
conversions = ["Purchases", # Completing a transaction or buying a product
               "Sign-Up's", # Registering for an account, newsletter, or service
               "Downloads", # Downloading a file, app, or resource
               "Form Submissions", # Filling out and submitting a contact form or survey
               "Lead Generation"] # Providing contact information or expressing interest in a service or product


# Define the Change in Followers
initital_followers = 100000 # Initialize the first follower count
current_followers = initital_followers
# Create Lists to hold follower changes and current followers
change_in_followers = []
follower_counts = []
# Generate change in followers with random increases and decreases
for i in range(total):
    change = np.random.randint(-1000, 5000)
    change_in_followers.append(change)
    # Update the current follower count
    current_followers += change
    follower_counts.append(max(current_followers, 0)) # Ensure non-negative follower counts

In [3]:
# Create Random Data for the Project

data = {# Post Information
       "ID":[i for i in range(total)],
       "Date":pd.date_range("2020-01-01", periods = total),
       "Time":[random.choice(time) for i in range(total)],
       "Content":[random.choice(content) for i in range(total)],
       "Category":[random.choice(categories) for i in range(total)],
       "Platform":[random.choice(platforms) for i in range(total)],
    
       # Post Metrics
       "Likes":np.random.randint(500, 100000, size = total),
       "Comments":np.random.randint(100, 1000, size = total),
       "Shares":np.random.randint(100,2000, size = total),
       "Views":np.random.randint(1000, 50000, size = total),
       "Hashtags":np.random.randint(0, 15, size = total),
        
       # Audience Metrics
       "Followers": follower_counts,
       "Change in Followers":change_in_followers,
        
       # Business Metrics
       "Campaign ID": [random.choice(campaigns) for i in range(total)],
       "Campaign Budget":np.random.randint(1000, 50000, size = total),
       "Post Objective": [random.choice(objectives) for i in range(total)],
       "Impressions": np.random.randint(1000, 100000, size = total),
       "Clicks":np.random.randint(0, 5000, size = total),
       "Conversions":np.random.randint(0, 1000, size = total)
        
                       }

In [4]:
# Create a DataFrame from the data dictionary
df = pd.DataFrame(data)
df.head()

Unnamed: 0,ID,Date,Time,Content,Category,Platform,Likes,Comments,Shares,Views,Hashtags,Followers,Change in Followers,Campaign ID,Campaign Budget,Post Objective,Impressions,Clicks,Conversions
0,0,2020-01-01,Afternoon,Story,Fashion,Twitter-X,31919,723,103,6910,12,100146,146,Campaign 5,49641,Lead Generation,36399,1244,339
1,1,2020-01-02,Evening,Story,Fashion,Twitter-X,76479,641,311,8434,11,99320,-826,Campaign 4,24391,Lead Generation,86591,1383,676
2,2,2020-01-03,Morning,Video,Health,Twitter-X,7900,423,1082,16321,3,100807,1487,Campaign 5,43779,Increase Engagement,35652,2667,454
3,3,2020-01-04,Night,Video,Fashion,LinkedIn,40711,844,712,28192,6,100516,-291,Campaign 4,21530,Increase Engagement,99606,1460,380
4,4,2020-01-05,Night,Video,Music,Twitter-X,60508,474,1633,17934,2,103187,2671,Campaign 2,30140,Increase Engagement,50280,2046,633


In [5]:
# Clean the data

# Ensure the funnel follows Impressions -> Clicks -> Conversions
df = df[(df["Impressions"] > df["Clicks"]) & (df["Clicks"] > df["Conversions"])]

# Impressions -> Views
df = df[(df["Impressions"] > df["Views"])]

# Ensure that Views are only used in Videos and Stories
df.loc[~df["Content"].isin(["Video", "Story"]), "Views"] = 0

# Adjust the Likes, Comments, Shares, and Views for Viral Considerations
def viral_multiplier(row):
    if row["Change in Followers"] >= 3000:
        engagement_multiplier = np.random.uniform(2.5, 4.0) # Large increase in followers
        impression_multiplier = np.random.uniform(1.5, 2.5) # Moderate multiplier for Impressions
    elif 2000 <= row["Change in Followers"] < 5000:
        engagement_multiplier = np.random.uniform(1.5, 2.5) # Moderate increase in followers
        impression_multiplier = np.random.uniform(1.2, 1.8)
    elif 500 <= row["Change in Followers"] < 2000:
        engagement_multiplier = np.random.uniform(1.1, 1.5) # Small increase in followers
        impression_multiplier = np.random.uniform(1.1, 1.3)
    else:
        engagement_multiplier = np.random.uniform(0.25, 0.50) # Apply a reduction in engagement
        impression_multiplier = np.random.uniform(0.25, 0.50) # Slight Decrease in Impressions
        
    # Apply the multiplier to the engagement metrics
    row["Likes"] = int(row["Likes"] * engagement_multiplier)
    row["Comments"] = int(row["Comments"] * engagement_multiplier)
    row["Shares"] = int(row["Shares"] * engagement_multiplier)
    row["Views"] = int(row["Views"] * engagement_multiplier)
    
    # Apply the multiplier to the Impression metrics
    row["Impressions"] = int(row["Impressions"] * impression_multiplier)
    row["Clicks"] = int(row["Clicks"] * impression_multiplier)
    row["Conversions"] = int(row["Conversions"] * impression_multiplier)

    
    return row

# Apply viral_multiplier function to each row of the DataFrame
df = df.apply(viral_multiplier, axis = 1)
    

In [6]:
df.head(10)

Unnamed: 0,ID,Date,Time,Content,Category,Platform,Likes,Comments,Shares,Views,Hashtags,Followers,Change in Followers,Campaign ID,Campaign Budget,Post Objective,Impressions,Clicks,Conversions
0,0,2020-01-01,Afternoon,Story,Fashion,Twitter-X,9468,214,30,2049,12,100146,146,Campaign 5,49641,Lead Generation,15514,530,144
1,1,2020-01-02,Evening,Story,Fashion,Twitter-X,25892,217,105,2855,11,99320,-826,Campaign 4,24391,Lead Generation,36885,589,287
2,2,2020-01-03,Morning,Video,Health,Twitter-X,11413,611,1563,23578,3,100807,1487,Campaign 5,43779,Increase Engagement,43270,3236,551
3,3,2020-01-04,Night,Video,Fashion,LinkedIn,14131,292,247,9786,6,100516,-291,Campaign 4,21530,Increase Engagement,42815,627,163
4,4,2020-01-05,Night,Video,Music,Twitter-X,144902,1135,3910,42947,2,103187,2671,Campaign 2,30140,Increase Engagement,77295,3145,973
5,5,2020-01-06,Afternoon,Image,Food,Instagram,16888,137,388,0,5,102643,-544,Campaign 5,3401,Lead Generation,16383,312,242
6,6,2020-01-07,Morning,Video,Travel,Instagram,203781,1966,2217,32143,6,106132,3489,Campaign 1,34381,Lead Generation,63856,8659,2148
11,11,2020-01-12,Morning,Image,Music,Instagram,229315,3280,4656,0,11,119243,4433,Campaign 3,26114,Increase Engagement,128817,4969,266
12,12,2020-01-13,Afternoon,Video,Fashion,LinkedIn,22310,82,46,15141,9,119189,-54,Campaign 5,32721,Conversions,29101,583,40
13,13,2020-01-14,Evening,Story,Health,Instagram,129639,1544,6157,59661,5,123435,4246,Campaign 5,6351,Increase Engagement,129780,5402,1294


In [7]:


df["Day of the Week"] = df["Date"].dt.day_name()
df['Weekday-Weekend'] = df['Date'].dt.day_name().apply(lambda x: 'Weekend' if x in ['Saturday', 'Sunday'] 
                                                       else 'Weekday')

df["Engaement Rate"] = (df["Likes"] + df["Comments"] + df["Shares"]) / df["Impressions"]
df["CTR"] = df["Clicks"] / df["Impressions"]
df["ROI"] = (df["Conversions"] * 100) / df["Campaign Budget"]
# Review the Head of the DataFrame
df.head(10)

Unnamed: 0,ID,Date,Time,Content,Category,Platform,Likes,Comments,Shares,Views,...,Campaign Budget,Post Objective,Impressions,Clicks,Conversions,Day of the Week,Weekday-Weekend,Engaement Rate,CTR,ROI
0,0,2020-01-01,Afternoon,Story,Fashion,Twitter-X,9468,214,30,2049,...,49641,Lead Generation,15514,530,144,Wednesday,Weekday,0.626015,0.034163,0.290083
1,1,2020-01-02,Evening,Story,Fashion,Twitter-X,25892,217,105,2855,...,24391,Lead Generation,36885,589,287,Thursday,Weekday,0.710695,0.015969,1.176664
2,2,2020-01-03,Morning,Video,Health,Twitter-X,11413,611,1563,23578,...,43779,Increase Engagement,43270,3236,551,Friday,Weekday,0.314005,0.074786,1.258594
3,3,2020-01-04,Night,Video,Fashion,LinkedIn,14131,292,247,9786,...,21530,Increase Engagement,42815,627,163,Saturday,Weekend,0.342637,0.014644,0.757083
4,4,2020-01-05,Night,Video,Music,Twitter-X,144902,1135,3910,42947,...,30140,Increase Engagement,77295,3145,973,Sunday,Weekend,1.939931,0.040688,3.228268
5,5,2020-01-06,Afternoon,Image,Food,Instagram,16888,137,388,0,...,3401,Lead Generation,16383,312,242,Monday,Weekday,1.06287,0.019044,7.115554
6,6,2020-01-07,Morning,Video,Travel,Instagram,203781,1966,2217,32143,...,34381,Lead Generation,63856,8659,2148,Tuesday,Weekday,3.256765,0.135602,6.247637
11,11,2020-01-12,Morning,Image,Music,Instagram,229315,3280,4656,0,...,26114,Increase Engagement,128817,4969,266,Sunday,Weekend,1.841768,0.038574,1.018611
12,12,2020-01-13,Afternoon,Video,Fashion,LinkedIn,22310,82,46,15141,...,32721,Conversions,29101,583,40,Monday,Weekday,0.771039,0.020034,0.122246
13,13,2020-01-14,Evening,Story,Health,Instagram,129639,1544,6157,59661,...,6351,Increase Engagement,129780,5402,1294,Tuesday,Weekday,1.058252,0.041624,20.374744


In [8]:
# Review the Head of the DataFrame
df.head()

Unnamed: 0,ID,Date,Time,Content,Category,Platform,Likes,Comments,Shares,Views,...,Campaign Budget,Post Objective,Impressions,Clicks,Conversions,Day of the Week,Weekday-Weekend,Engaement Rate,CTR,ROI
0,0,2020-01-01,Afternoon,Story,Fashion,Twitter-X,9468,214,30,2049,...,49641,Lead Generation,15514,530,144,Wednesday,Weekday,0.626015,0.034163,0.290083
1,1,2020-01-02,Evening,Story,Fashion,Twitter-X,25892,217,105,2855,...,24391,Lead Generation,36885,589,287,Thursday,Weekday,0.710695,0.015969,1.176664
2,2,2020-01-03,Morning,Video,Health,Twitter-X,11413,611,1563,23578,...,43779,Increase Engagement,43270,3236,551,Friday,Weekday,0.314005,0.074786,1.258594
3,3,2020-01-04,Night,Video,Fashion,LinkedIn,14131,292,247,9786,...,21530,Increase Engagement,42815,627,163,Saturday,Weekend,0.342637,0.014644,0.757083
4,4,2020-01-05,Night,Video,Music,Twitter-X,144902,1135,3910,42947,...,30140,Increase Engagement,77295,3145,973,Sunday,Weekend,1.939931,0.040688,3.228268


In [9]:
# Create a table showing the category counts
category_counts = df.Category.value_counts()
print(category_counts)

Category
Music      98
Food       90
Travel     86
Culture    85
Fashion    84
Fitness    81
Health     78
Family     77
Name: count, dtype: int64


In [10]:
# Describe the numeric values of the DataFrame
df.describe()

Unnamed: 0,ID,Date,Likes,Comments,Shares,Views,Hashtags,Followers,Change in Followers,Campaign Budget,Impressions,Clicks,Conversions,Engaement Rate,CTR,ROI
count,679.0,679,679.0,679.0,679.0,679.0,679.0,679.0,679.0,679.0,679.0,679.0,679.0,679.0,679.0,679.0
mean,502.88218,2021-05-17 21:10:20.324005888,88306.56701,1009.622975,1895.764359,26858.539028,7.08542,1115673.0,1936.471281,25459.316642,78891.840943,3547.55081,636.687776,1.41334,0.059372,4.896736
min,0.0,2020-01-01 00:00:00,293.0,28.0,30.0,0.0,0.0,99320.0,-1000.0,1031.0,1779.0,26.0,1.0,0.018665,0.001197,0.003228
25%,263.5,2020-09-20 12:00:00,20971.5,304.0,519.5,0.0,4.0,636973.0,525.5,13231.0,33395.0,1299.5,204.5,0.481088,0.023993,0.867837
50%,504.0,2021-05-19 00:00:00,59197.0,764.0,1369.0,11489.0,7.0,1146824.0,1899.0,25276.0,75367.0,3032.0,472.0,0.96456,0.044115,2.18962
75%,750.5,2022-01-20 12:00:00,129637.0,1490.5,2770.5,44063.0,11.0,1594396.0,3350.5,37321.0,110992.5,5178.5,990.5,1.715122,0.069872,4.533758
max,999.0,2022-09-26 00:00:00,380618.0,3601.0,7641.0,183999.0,14.0,2077492.0,4995.0,49922.0,241289.0,12234.0,2309.0,21.472922,0.96537,142.942494
std,285.775467,,83833.193008,858.053255,1687.406712,35044.581437,4.108242,564318.7,1695.804137,13957.543549,50285.320988,2683.147456,524.065469,1.763325,0.070139,9.860801


In [11]:
df.to_csv("social_media_data.csv", index = False)