# Title: Exploratory Data Analysis (EDA)

### Checking the Number of Rows
- **Objective**: Determine the number of rows in both CSV files.
- **Reason**: To understand the volume of data collected in each file and verify completeness.

In [2]:
import pandas as pd

df1 = pd.read_csv('climate_change_posts_streaming.csv')
df2 = pd.read_csv('climate_change_sentiment_analysis_streaming.csv')

print('climate_change_posts_streaming.csv')
print(len(df1))
print("")
print('climate_change_sentiment_analysis_streaming.csv')
print(len(df2))

climate_change_posts_streaming.csv
3973

climate_change_sentiment_analysis_streaming.csv
3973


### Checking for Duplication
- **Objective**: Identify duplicate entries in both files.
- **Reason**: Ensure data integrity and prevent redundancy during analysis.

In [3]:
df1 = pd.read_csv('climate_change_posts_streaming.csv')
df2 = pd.read_csv('climate_change_sentiment_analysis_streaming.csv')

# Count how many times each ID appears
duplicate_counts1 = df1['ID'].value_counts()
duplicate_counts2 = df2['ID'].value_counts()

# Filter IDs that appear more than once
duplicate_ids1 = duplicate_counts1[duplicate_counts1 > 1]
print('climate_change_posts_streaming.csv')
print(f"Duplicate IDs:\n{duplicate_ids1}")
print("")
duplicate_ids2 = duplicate_counts2[duplicate_counts2 > 1]
print('climate_change_sentiment_analysis_streaming.csv')
print(f"Duplicate IDs:\n{duplicate_ids2}")

climate_change_posts_streaming.csv
Duplicate IDs:
Series([], Name: count, dtype: int64)

climate_change_sentiment_analysis_streaming.csv
Duplicate IDs:
Series([], Name: count, dtype: int64)


### Verifying Sentiment Analysis Application
- **Objective**: Confirm that all posts collected in **File 1** have been analyzed and saved in **File 2**.
- **Reason**: Ensure every collected post has undergone sentiment analysis and no data is missed.

In [4]:
import pandas as pd

# Load both CSV files
df1 = pd.read_csv('climate_change_posts_streaming.csv') 
df2 = pd.read_csv('climate_change_sentiment_analysis_streaming.csv')  

# Assuming the ID column in both CSV files is named 'ID'
# Adjust the column name if necessary
ids_df1 = df1['ID']  # IDs in the first CSV file
ids_df2 = df2['ID']  # IDs in the second CSV file

# Check if all IDs in df1 are in df2
ids_in_both = ids_df1.isin(ids_df2)

# Check if all IDs in the first CSV are present in the second CSV
if ids_in_both.all():
    print("All IDs in the first CSV are included in the second CSV.")
else:
    missing_ids = ids_df1[~ids_in_both]
    print("The following IDs are missing from the second CSV:", missing_ids.tolist())

All IDs in the first CSV are included in the second CSV.


### Extracting November 2024 Data
- **Objective**: Filter and isolate data from November 2024.
- **Reason**: Focus analysis on posts collected within a specific time frame for detailed insights.

In [5]:
import pandas as pd

# File paths
file_path = 'climate_change_sentiment_analysis_streaming.csv'  # Input file
output_file = 'filtered_data_november_2024_streaming.csv'  # Output file

# Load the data
df = pd.read_csv(file_path)

# Ensure the Date column is in datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Filter rows where the date is in November 2024
filtered_data = df[(df['Date'].dt.month == 11) & (df['Date'].dt.year == 2024)]

# Save the filtered data to a new file
filtered_data.to_csv(output_file, index=False)

print(f"Extracted {len(filtered_data)} rows from November 2024. Saved to {output_file}.")

Extracted 153 rows from November 2024. Saved to filtered_data_november_2024_streaming.csv.
