# Exploratory Data Analysis

In this notebook, we will perform exploratory data analysis (EDA) on the text data used for summarization. The goal is to understand the characteristics of the dataset, visualize data distributions, and prepare for further analysis.

In [1]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set(style='whitegrid')

In [2]:
# Load the cleaned data
cleaned_data_path = '../data/processed/cleaned_data.txt'
with open(cleaned_data_path, 'r') as file:
    cleaned_data = file.read()

# Convert the cleaned data into a DataFrame
data = pd.DataFrame({'text': [cleaned_data]})

In [3]:
# Check the shape of the dataset
data.shape

(1, 1)

The dataset contains a single entry of cleaned text data. We will now analyze the length of the text and visualize its distribution.

In [4]:
# Calculate the length of the text
data['text_length'] = data['text'].apply(len)

# Visualize the distribution of text length
plt.figure(figsize=(10, 6))
sns.histplot(data['text_length'], bins=30, kde=True)
plt.title('Distribution of Text Length')
plt.xlabel('Length of Text')
plt.ylabel('Frequency')
plt.show()

Next, we will analyze the most common words in the text data to understand its content better.

In [5]:
# Import additional libraries for text processing
from collections import Counter
import re

# Function to clean and tokenize text
def tokenize(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    tokens = text.lower().split()
    return tokens

# Tokenize the text
tokens = tokenize(data['text'][0])

# Count the frequency of each word
word_counts = Counter(tokens)
most_common_words = word_counts.most_common(10)

# Create a DataFrame for visualization
common_words_df = pd.DataFrame(most_common_words, columns=['Word', 'Frequency'])

# Visualize the most common words
plt.figure(figsize=(10, 6))
sns.barplot(x='Frequency', y='Word', data=common_words_df)
plt.title('Most Common Words in Text')
plt.xlabel('Frequency')
plt.ylabel('Word')
plt.show()

In this notebook, we have performed exploratory data analysis on the cleaned text data. We visualized the distribution of text lengths and identified the most common words. This analysis will help inform the summarization methods we choose to implement.