In [None]:
df.to_csv('cleaned_social_media_engagement.csv', index=False)
print("Cleaned data saved to 'cleaned_social_media_engagement.csv'")

Cleaned data saved to 'cleaned_social_media_engagement.csv'


In [None]:
import pandas as pd

df = pd.read_csv('/content/Social Media Engagement Dataset.csv')
print("DataFrame loaded successfully. Displaying the first 5 rows:")
df.head()

DataFrame loaded successfully. Displaying the first 5 rows:


Unnamed: 0,post_id,timestamp,day_of_week,platform,user_id,location,language,text_content,hashtags,mentions,...,comments_count,impressions,engagement_rate,brand_name,product_name,campaign_name,campaign_phase,user_past_sentiment_avg,user_engagement_growth,buzz_change_rate
0,kcqbs6hxybia,2024-12-09 11:26:15,Monday,Instagram,user_52nwb0a6,"Melbourne, Australia",pt,Just tried the Chromebook from Google. Best pu...,#Food,,...,701,18991,0.19319,Google,Chromebook,BlackFriday,Launch,0.0953,-0.3672,19.1
1,vkmervg4ioos,2024-07-28 19:59:26,Sunday,Twitter,user_ucryct98,"Tokyo, Japan",ru,Just saw an ad for Microsoft Surface Laptop du...,"#MustHave, #Food","@CustomerService, @BrandCEO",...,359,52764,0.05086,Microsoft,Surface Laptop,PowerRelease,Post-Launch,0.1369,-0.451,-42.6
2,memhx4o1x6yu,2024-11-23 14:00:12,Saturday,Reddit,user_7rrev126,"Beijing, China",ru,What's your opinion about Nike's Epic React? ...,"#Promo, #Food, #Trending",,...,643,8887,0.45425,Nike,Epic React,BlackFriday,Post-Launch,0.2855,-0.4112,17.4
3,bhyo6piijqt9,2024-09-16 04:35:25,Monday,YouTube,user_4mxuq0ax,"Lagos, Nigeria",en,Bummed out with my new Diet Pepsi from Pepsi! ...,"#Reviews, #Sustainable","@StyleGuide, @BrandSupport",...,743,6696,0.42293,Pepsi,Diet Pepsi,LaunchWave,Launch,-0.2094,-0.0167,-5.5
4,c9dkiomowakt,2024-09-05 21:03:01,Thursday,Twitter,user_l1vpox2k,"Berlin, Germany",hi,Just tried the Corolla from Toyota. Absolutely...,"#Health, #Travel","@BrandSupport, @InfluencerName",...,703,47315,0.08773,Toyota,Corolla,LocalTouchpoints,Launch,0.6867,0.0807,38.8


## Handle Missing Values



**Reasoning**:
To identify columns with missing values and understand their extent, I will first inspect the DataFrame `df` and calculate the count and percentage of missing values for each column.



In [None]:
missing_values = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100

missing_info = pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percentage': missing_percentage
})

missing_info = missing_info[missing_info['Missing Count'] > 0].sort_values(by='Missing Percentage', ascending=False)

print("Columns with missing values and their percentages:")
print(missing_info)

Columns with missing values and their percentages:
          Missing Count  Missing Percentage
mentions           3941           32.841667


**Reasoning**:
The 'mentions' column has a significant percentage of missing values (32.84%). Since it's a text-based column, imputing with mean or median is not applicable. Dropping the column might lead to loss of potentially useful information, and dropping rows would result in considerable data loss. A suitable strategy is to fill the missing values with an empty string, which allows us to retain the rows and indicates the absence of mentions without introducing bias or losing data.



In [None]:
df['mentions'] = df['mentions'].fillna('')
print("Missing values in 'mentions' column filled with empty strings.")
print(f"Number of missing values in 'mentions' after filling: {df['mentions'].isnull().sum()}")

Missing values in 'mentions' column filled with empty strings.
Number of missing values in 'mentions' after filling: 0


**Reasoning**:
To confirm that all missing values have been addressed after handling the 'mentions' column, I will check for any remaining missing values across the entire DataFrame.



In [None]:
missing_after_fill = df.isnull().sum()
missing_after_fill = missing_after_fill[missing_after_fill > 0]

if missing_after_fill.empty:
    print("No missing values remaining in the DataFrame.")
else:
    print("Remaining missing values after handling 'mentions':")
    print(missing_after_fill)

No missing values remaining in the DataFrame.


## Handle Duplicate Values



**Reasoning**:
To ensure data integrity, I will identify and count duplicate rows in the DataFrame. After assessing the number of duplicates, I will remove them to prevent any skewed analysis or redundancy. Finally, I will display the shape of the DataFrame to confirm the successful removal of duplicates and the final size of the dataset.

In [None]:
initial_rows = df.shape[0]
duplicate_rows = df.duplicated().sum()
print(f"Number of duplicate rows found: {duplicate_rows}")

df.drop_duplicates(inplace=True)

remaining_rows = df.shape[0]
print(f"Number of rows after removing duplicates: {remaining_rows}")
print(f"Number of rows removed: {initial_rows - remaining_rows}")

Number of duplicate rows found: 0
Number of rows after removing duplicates: 12000
Number of rows removed: 0


## Summarize Cleaned Data



**Reasoning**:
To confirm the cleaning steps were successful, I will display the shape, data types, and a sample of the first few rows of the cleaned DataFrame.



In [None]:
print(f"Shape of the cleaned DataFrame: {df.shape}")
print("\nConcise summary of the DataFrame:")
df.info()
print("\nFirst 5 rows of the cleaned DataFrame:")
df.head()

Shape of the cleaned DataFrame: (12000, 28)

Concise summary of the DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 28 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   post_id                  12000 non-null  object 
 1   timestamp                12000 non-null  object 
 2   day_of_week              12000 non-null  object 
 3   platform                 12000 non-null  object 
 4   user_id                  12000 non-null  object 
 5   location                 12000 non-null  object 
 6   language                 12000 non-null  object 
 7   text_content             12000 non-null  object 
 8   hashtags                 12000 non-null  object 
 9   mentions                 12000 non-null  object 
 10  keywords                 12000 non-null  object 
 11  topic_category           12000 non-null  object 
 12  sentiment_score          12000 non-null  float64
 1

Unnamed: 0,post_id,timestamp,day_of_week,platform,user_id,location,language,text_content,hashtags,mentions,...,comments_count,impressions,engagement_rate,brand_name,product_name,campaign_name,campaign_phase,user_past_sentiment_avg,user_engagement_growth,buzz_change_rate
0,kcqbs6hxybia,2024-12-09 11:26:15,Monday,Instagram,user_52nwb0a6,"Melbourne, Australia",pt,Just tried the Chromebook from Google. Best pu...,#Food,,...,701,18991,0.19319,Google,Chromebook,BlackFriday,Launch,0.0953,-0.3672,19.1
1,vkmervg4ioos,2024-07-28 19:59:26,Sunday,Twitter,user_ucryct98,"Tokyo, Japan",ru,Just saw an ad for Microsoft Surface Laptop du...,"#MustHave, #Food","@CustomerService, @BrandCEO",...,359,52764,0.05086,Microsoft,Surface Laptop,PowerRelease,Post-Launch,0.1369,-0.451,-42.6
2,memhx4o1x6yu,2024-11-23 14:00:12,Saturday,Reddit,user_7rrev126,"Beijing, China",ru,What's your opinion about Nike's Epic React? ...,"#Promo, #Food, #Trending",,...,643,8887,0.45425,Nike,Epic React,BlackFriday,Post-Launch,0.2855,-0.4112,17.4
3,bhyo6piijqt9,2024-09-16 04:35:25,Monday,YouTube,user_4mxuq0ax,"Lagos, Nigeria",en,Bummed out with my new Diet Pepsi from Pepsi! ...,"#Reviews, #Sustainable","@StyleGuide, @BrandSupport",...,743,6696,0.42293,Pepsi,Diet Pepsi,LaunchWave,Launch,-0.2094,-0.0167,-5.5
4,c9dkiomowakt,2024-09-05 21:03:01,Thursday,Twitter,user_l1vpox2k,"Berlin, Germany",hi,Just tried the Corolla from Toyota. Absolutely...,"#Health, #Travel","@BrandSupport, @InfluencerName",...,703,47315,0.08773,Toyota,Corolla,LocalTouchpoints,Launch,0.6867,0.0807,38.8


## Summary:

### Data Analysis Key Findings
*   The initial dataset, 'Social Media Engagement Dataset.csv', was successfully loaded and contains 12,000 rows and 28 columns.
*   **Missing Values Handling**:
    *   Only one column, 'mentions', initially had missing values, totaling 3,941 entries, which represented 32.84% of its values.
    *   These missing values in the 'mentions' column were imputed with empty strings (`''`).
    *   After this process, the dataset was confirmed to have no remaining missing values across any column.
*   **Duplicate Values Handling**:
    *   An initial check revealed that the dataset contained 0 duplicate rows.
    *   Consequently, no rows were removed during the duplicate handling step, and the dataset size remained unchanged.
*   **State of the Dataset After Cleaning**:
    *   The cleaned dataset now contains 12,000 rows and 28 columns.
    *   All columns are complete, with 12,000 non-null values, indicating the successful handling of missing data.
    *   The dataset is confirmed to be free of duplicate entries.

### Insights or Next Steps
*   The dataset is now clean and prepared for further analysis, free from common data quality issues like missing and duplicate values, ensuring the reliability of subsequent analytical tasks.
*   The next step could involve exploratory data analysis (EDA) to understand distributions, relationships, and potential outliers among the various engagement metrics and categorical features.
