Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.

Due to its importance let's explore our dataset and our texts

# Import necessary libraries

In [None]:
import itertools
import collections
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Importing wordcloud for plotting word clouds and textwrap for wrapping longer text
from wordcloud import WordCloud
# Importing spacy
import spacy
# Loading model
nlp = spacy.load('en_core_web_sm',disable=['parser', 'ner'])

# Examine and clean the Textual Data

In [None]:
train_file = "../input/commonlitreadabilityprize/train.csv"

In [None]:
data = pd.read_csv(train_file)
data.head()

In [None]:
data['binned_target'] = pd.cut(data['target'], bins=10)
data['binned_target'].value_counts()

In [None]:
data['target'].plot.hist(bins=10, alpha=0.5)

In [None]:
# Lemmatization with stopwords removal using spacy
data['lemmatized']=data['excerpt'].apply(lambda x: ' '.join([token.lemma_ for token in list(nlp(x)) if ((not token.is_punct) and (token.is_stop==False))]))

# Vectorization

In [None]:
# Creating Bag of words vectors
cv=CountVectorizer(analyzer='word')
bow_vectors=cv.fit_transform(data['lemmatized'])

In [None]:
# Visualizing our vectors
# For bag of words
pca = PCA(n_components=2)
x_pca = pca.fit_transform(bow_vectors.todense())
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=data['target'],cmap='rainbow')
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')

In [None]:
# Creating vectors using TF-IDF 
TFIDF_vectorizer = TfidfVectorizer(min_df=5)
tfidf_vectors = TFIDF_vectorizer.fit_transform(data['lemmatized'])

In [None]:
# Visualizing our vectors
# For TF-IDF
pca = PCA(n_components=2)
x_pca = pca.fit_transform(tfidf_vectors.todense())
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=data['target'],cmap='rainbow')
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')

# WordCloud

In [None]:
# WordCloud
wordcloud = WordCloud(width = 3000, 
                      height = 2000, 
                      random_state=1, 
                      background_color='black', 
                      colormap='Set2', 
                      collocations=False).generate(" ".join(list(data['lemmatized'])))

# Save image
wordcloud.to_file("wordcloud.png")

# plot the WordCloud image                       
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
  
plt.show()

# Calculate and Plot Word Frequency

In [None]:
# List of all words across texts
all_words = " ".join(data['lemmatized']).split()

# Create counter
counts = collections.Counter(all_words)

counts.most_common(15)

In [None]:
clean_texts = pd.DataFrame(counts.most_common(50),
                             columns=['words', 'count'])

clean_texts.head()

In [None]:
fig, ax = plt.subplots(figsize=(12, 12))

# Plot horizontal bar graph
clean_texts.sort_values(by='count').plot.barh(x='words',
                      y='count',
                      ax=ax,
                      color="purple")

ax.set_title("Common Words Found in Excerpts after cleaning")

plt.show()

If you find the notebook interesting please upvote !!