# Data Cleaning and Analysis

This notebook contains the processes for cleaning and analyzing the Bitcoin tweet dataset. It includes steps for data loading, cleaning, exploratory data analysis, and visualizations.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Load the raw data
raw_data_path = '../data/raw/sample_200k-130k_duplicates.csv'
df = pd.read_csv(raw_data_path, on_bad_lines='skip')

# Display the first few rows of the dataset
df.head()

## Data Cleaning Steps

In this section, we will perform data cleaning operations such as handling missing values, normalizing text data, and removing duplicates.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

# Fill missing values or drop rows/columns as necessary
# Example: df.dropna(subset=['text'], inplace=True)

# Normalize text data (e.g., lowercasing)
df['text'] = df['text'].str.lower()

# Remove duplicates
df.drop_duplicates(subset=['id'], inplace=True)

# Display cleaned data info
df.info()

## Exploratory Data Analysis (EDA)

In this section, we will perform exploratory data analysis to understand the distribution of the data and visualize key insights.

In [None]:
# Visualize the distribution of tweet sentiments
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='sentiment')
plt.title('Distribution of Tweet Sentiments')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

# Additional visualizations can be added here

## Conclusion

This notebook has outlined the data cleaning and exploratory analysis processes for the Bitcoin tweet dataset. Further analysis and modeling can be conducted using the cleaned data.