# Exploratory Data Analysis

This notebook aims to quickly explore the "Stanford Sentiment treebank" dataset, in the context of sentence classification.

**Attention:** needs a condensed CSV version of the dataset, obtainable by running `python prepare_data.py` in the same directory of the zip downloadable from [here](https://nlp.stanford.edu/sentiment/code.html)

## Setup

In [1]:
from pathlib import Path

import altair as alt
import pandas as pd

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [2]:
data_path = Path("..") / "data" / "labelled_sentences.csv"
df = pd.read_csv(data_path)
df["sentence_length"] = df["sentence"].str.len()

## Distribution of lenghts, labels and splits

In [3]:
(
    alt.Chart(df, width=700, title="Sentence length distribution")
    .mark_bar()
    .encode(
        x=alt.X("sentence_length", bin=alt.Bin(maxbins=100), title="Sentence length"),
        y=alt.Y("count()", title="Number of elements")
    )
)

In [4]:
(
    alt.Chart(df, width=700, title="Sentiment distribution")
    .mark_bar()
    .encode(
        x=alt.X("label", bin=alt.Bin(maxbins=30), title="Sentiment"),
        y=alt.Y("count()", title="Number of elements")
    )
)

In [5]:
(
    alt.Chart(df, width=700, title="Dataset split count")
    .mark_bar()
    .encode(
        x=alt.X("split:N", title="Split"),
        y=alt.Y("count()", title="Number of elements")
    )
)

## Other visualisations

In [6]:
scatter_plot = (
    alt.Chart(df, width=700, height=700, title="Scatter plot of sentiment vs. sentence length")
    .mark_point(radius=0.5, filled=True)
    .encode(
        x=alt.X("sentence_length", title="Sentence length"), y=alt.Y("label", title="Sentiment")
    )
)

(
    scatter_plot 
    + 
    scatter_plot
    .transform_regression("sentence_length", "label")
    .mark_line(strokeWidth=5)
    .encode(color=alt.value("#ff7f0e"))
)