# Exploratory data analysis

This notebook helps to understand the metadata better, incl. product portfolio per disease, missing data for treatment / antibodies and rating, as well as the distribution of the rating.

In [None]:
import pandas as pd
from pathlib import Path

In [None]:
processed_data_path = Path("../data_preprocessing/data/preprocessed.csv")
df = pd.read_csv(processed_data_path)

In [None]:
df.head()

## Missing data

In [None]:
print("Number of comments where the treatment is missing:", df.treatment.isna().sum())

Let's have a look at the share of distributions of diseases where information about the treatment is missing.

In [None]:
df.loc[df.treatment.isna()].groupby("disease").disease.count()

In [None]:
df.loc[df.treatment.isna(), ["text_index", "comment"]]

In [None]:
print("Number of missing ratings:", df.rate.isna().sum())

Let's have a look at the share of distributions of diseases where information about the rating is missing.

In [None]:
df.loc[df.rate.isna()].groupby("disease").disease.count()

## Product portfolio

In [None]:
df.disease.value_counts()

In [None]:
df.disease.value_counts().plot(kind="bar");

In [None]:
df.antibody.value_counts()

In [None]:
df.antibody.value_counts().plot(kind="bar");

In [None]:
df.treatment.value_counts()

In [None]:
df.treatment.value_counts().plot(kind="bar");

In [None]:
df.groupby(["disease", "antibody"]).treatment.value_counts()

In [None]:
df.groupby("disease").antibody.value_counts().unstack(0).plot.barh(
    title="Nr of patients per antibody for each disease"
);

In [None]:
df.groupby("disease").treatment.value_counts().unstack(0).plot.barh(
    title="Nr of patients per treatment for each disease"
);

## Rating

In [None]:
df.groupby(["disease", "antibody", "treatment"]).rate.mean()

In [None]:
df.groupby(["disease", "antibody"]).rate.mean().unstack(0).plot(
    kind="bar", title="Avg. rate for the antibody per treatment"
);

In [None]:
df.groupby(["disease", "treatment"]).rate.mean().unstack(0).plot(
    kind="bar", title="Avg. rate for the antibody per treatment"
);

In [None]:
idx_rating = sorted(df.rate.dropna().unique())

In [None]:
df.rate.value_counts()[idx_rating].plot(
    kind="bar", title="Distribution of ratings for all data"
);

In [None]:
df.loc[df.disease == "Crohn's Disease"].rate.value_counts()[idx_rating].plot(
    kind="bar", title="Distribution of ratings or Crohn's Disease"
);

In [None]:
df.loc[df.disease == "Ulcerative Colitis"].rate.value_counts()[idx_rating].plot(
    kind="bar", title="Distribution of ratings or Ulcerative Colitis"
);

## Treatment type

In [None]:
print(
    "Share of nan treatment types:", round(df.treatment_type.isna().sum() / len(df), 2)
)

In [None]:
df.groupby(["disease", "antibody", "treatment"]).treatment_type.value_counts(
    dropna=False
)