# Setup

Importing modules.

In [None]:
import pandas as pd
import numpy as np
from itables import init_notebook_mode
import missingno as msno
import seaborn as sns

Loading in data from csv.

In [None]:
DF_ORIGINAL = pd.read_csv("../data/manga.csv")

Filling all null tag and genre features with zero, as that is their implicit value.

In [None]:
ZEROABLE_NUMERIC_COLUMN_NAMES = list(DF_ORIGINAL.drop(["id","chapters","volumes","start_year","start_month","start_day","end_year","end_month","end_day"],axis=1).select_dtypes(include=['number']).columns.values)
DEFAULT_ZEROES = [0] * len(ZEROABLE_NUMERIC_COLUMN_NAMES)
NULL_MAP = dict(zip(ZEROABLE_NUMERIC_COLUMN_NAMES, DEFAULT_ZEROES))
ZEROED_DF = DF_ORIGINAL.fillna(value=NULL_MAP)

Computing target feature from data.

In [None]:
SHOUNEN_TAG_PCT = ZEROED_DF["Shounen"]
SHOUJO_TAG_PCT = ZEROED_DF["Shoujo"]
SEINEN_TAG_PCT = ZEROED_DF["Seinen"]
JOSEI_TAG_PCT = ZEROED_DF["Josei"]

demo_col = []

for x in range(ZEROED_DF.shape[0]):
    demo_dict = {"Shounen": SHOUNEN_TAG_PCT[x],
                 "Shoujo": SHOUJO_TAG_PCT[x],
                 "Seinen": SEINEN_TAG_PCT[x],
                 "Josei": JOSEI_TAG_PCT[x]}
    
    if sum(demo_dict.values()) > 0:   
        max_vk = max(((v, k) for (k, v) in demo_dict.items()))
        demo_label = max_vk[1]
    else:
        demo_label = None
    
    demo_col.append(demo_label)
    del demo_label
    
DEMO_ADDED_DF = (ZEROED_DF
                 .assign(demo_label = pd.Series(demo_col))
                 .drop(["Shounen","Shoujo","Seinen","Josei"], axis=1))

Separating out the "usable" data, i.e. the data with non-missing labels.

In [None]:
usable_df = DEMO_ADDED_DF.dropna(subset="demo_label")

# Full-dataset summaries

Viewing a few records of data frame.

In [None]:
# Display data frames interactively
init_notebook_mode(all_interactive=True)

usable_df.head()

In [None]:
usable_df.tail()

Checking shape of data.

In [None]:
usable_df.shape

Checking summary statistics on numeric features, excluding the media ID.

In [None]:
# Computing summary statistics for numeric columns, excluding the ID, adding in an explicit percentage of null values.
summary_df = usable_df.drop("id", axis=1).describe(include=np.number)
summary_df.loc["pct_null"] = [f"{pct:0.2%}" for pct in DEMO_ADDED_DF
                              .drop("id", axis=1)
                              .select_dtypes(include=np.number)
                              .isna().mean().tolist()]
summary_df = summary_df.transpose()
summary_df

Checking the central tendency of each feature, conditioned on our labels.

In [None]:
usable_df.groupby("demo_label").describe().transpose().reset_index().rename(columns={"level_0": "feature", "level_1":"metric"}).rename_axis(None, axis=1).query("metric in ['mean', '50%']")

The results of the summaries by group are interesting, because they frequently fly in the face of received wisdom about the demographic labels. It reminds us that there is a non-response bias in the tag completion rate which we may need to account for during feature extraction.

Checking summary statistics for categorical features as well, excluding names of media.

In [None]:
def cat_summary_frame(df, colname, naincl = False):

    """
    Within the dataframe passed as the first argument to this function, 
    summarizes the count of each level of the feature whose name is 
    passed as the second argument, as well as the percentage of total 
    observations each count represents.
    """

    summ_df = df.groupby(colname, dropna = naincl).size().to_frame().rename(columns={0: "count"})
    summ_df = summ_df.assign(pct = round((summ_df["count"] / summ_df["count"].sum()) * 100, 2))
    summ_df = summ_df.assign(pct = summ_df["pct"].astype("string"))
    summ_df = summ_df.assign(pct = summ_df["pct"] + "%")
    return summ_df.sort_values("count", ascending=False).reset_index()

cat_summary_frame(usable_df, "status")

In [None]:
cat_summary_frame(usable_df, "source")

In [None]:
cat_summary_frame(usable_df, "country")

In [None]:
cat_summary_frame(usable_df, "demo_label")

# Examining Missingness

Using visualizations to identify patterns in missingness.

In [None]:
features_with_missingness = usable_df.isna().any(axis=0).to_frame().rename(columns={0: "is_na"}).query("is_na").index.tolist()
df_for_missing_dendogram = usable_df[features_with_missingness]
df_for_missing_matrix = df_for_missing_dendogram.sample(200, random_state=1234)
msno.matrix(df_for_missing_matrix)

In [None]:
msno.heatmap(usable_df)

In [None]:
msno.dendrogram(df_for_missing_dendogram)

Through these visualizations we can see that there are two main groupings of features whose missingness is relatively highly correlated:
 - start_month, start_day, start_year, and source
 - end_year, end_month, end_day, chapters, and volumes

I would hypothesize that much of the missingness in the first group is negatively correlated with the level of a manga's popularity. The reason for this is that I suspect these are cases where manga with smaller fanbases have fewer people willing to research and/or contribute data to complete the start date, status, or source of that particular manga.

I would also hypothesize that the missingness of the second group is correlated with the status of a manga, because oftentimes the convention on the site is to leave things like the end date and number of chapters and volumes blank until the work has reached its conclusion.

We will display a correlation heatmap that includes missingness on one axis and features that have a correlation of at least .2 with at least one of the missingness indicators on the other axis in order to investigate these hypotheses.

In [None]:
missing_indicator_df = df_for_missing_dendogram.isna().add_prefix("MISSING_")
dummy_df = pd.get_dummies(usable_df.drop(["eng_title","rom_title","id"], axis="columns"))
missingness_corr = pd.concat([missing_indicator_df, dummy_df], axis="columns").corr()[missing_indicator_df.columns.tolist()]
high_corr_features = missingness_corr.abs().ge(.2).any(axis="columns").to_frame().rename(columns={0: "high_corr"}).query("high_corr").filter(regex="^(?!MISSING_).*", axis=0).index.tolist()
missingness_corr_2 = missingness_corr.filter(items=high_corr_features, axis=0)
sns.heatmap(missingness_corr_2, cmap="bwr", center=0)

##### Group 1
My hypothesis about the first group can likely be rejected-- missingness in the start month and day seem to be moderately negatively correlated with start_year and end_year, likely indicating that older works are more likely to have missing data in these features than newer works. Missingness in start_year and source did not clearly correlate with any features.

##### Group 2
We likely cannot reject my hypothesis about the second group, as missingness in chapters, volumes, and end dates are highly positively correlated with a status of "releasing" and highly negatively correlated with a status of "finished". This provides us with valuable insight for how we might handle missing values in these features in the future.

##### Other Features
Missingness in eng_title seems to be highly correlated with a number of features that I believe are likely to scale along with popularity. We will explore this further later.

##### Further Research
More investigation into the mechanism for missingness in the future may involve additional sampling from outside AniList to find the characteristics of the missing data, and whether it differs from the data we have access to. Additionally, performing Little's Test in order to check if data is missing completely at random. Finally, if it seems likely that we have missing at random data, validating this by checking the characteristics of missingness within defined groups.

# Correlation

Next, we will explore correlation in our data, both between pairs of features and between each feature and our label.

In [None]:
correlation = dummy_df.corr()
upper_corr = correlation.where( 
    np.triu(np.ones(correlation.shape), k=1).astype(np.bool))
unstacked_corr = upper_corr.unstack().dropna()
unstacked_corr_2 = unstacked_corr
unstacked_corr_2.index = unstacked_corr.index.map(' & '.join)
sorted_unstacked_corr = unstacked_corr_2.reindex(unstacked_corr_2.abs().sort_values(ascending=False).index)

def highest_correlation(data, threshold, regex=".*"):
    """
    Takes the series passed to this function and outputs a list of correlated
    pairs that pass the desired threshold of correlation and include the
    variable name passed in regex.
    """

    return data.filter(regex=regex).loc[lambda x: (x > threshold) | (x < -threshold)]

highest_correlation(data=sorted_unstacked_corr, threshold=0.75)

In [None]:
highest_correlation(data=sorted_unstacked_corr, threshold=0.15, regex="demo_label")

In [None]:
highest_correlation(data=sorted_unstacked_corr, threshold=0.05, regex="Josei")

In [None]:
highest_correlation(data=sorted_unstacked_corr, threshold=0.15, regex="Full Color")

The list of the most highly correlated pairs reveals some patterns that may be important for feature reduction and feature extraction later on:
 - There are a number of features with perfect correlation, likely due to the gender field on characters in one particular manga or franchise containing several unique values. 
 - Features that scale along with the popularity of the manga (such as the score bucket features, user status features, and favorites) tend to be highly correlated with each other, as you might expect. This means that these features provide little new information except in context with one another.
 - End year and start year are highly correlated.
 - the anime relation and the adaptation relation are highly correlated, likely indicating that if a manga is adapted into something else that shows up on the site it will almost always be an anime.
 - Male background roles and female background roles both scale with total background roles. This seems to be true to a lesser extent with supporting and main roles as well. These features mainly provide additional information in context with one another, as with the popularity-scaling features.
 - Categorical indicators tend to be negatively correlated with one another, as we would expect.

Some interesting observations regarding correlations with our labels:
 - Shoujo and the romance genre, the heterosexual tag, and the male harem tag are positively correlated
 - Shoujo and the action genre, the ecchi genre, and the male protagonist tag are negatively correlated.
 - Shounen and the action genre, the recorded size of the supporting cast, the recorded size of the male main cast, the adventure genre, the male protagonist tag, and the number of related manga are positively correlated.
 - Shounen and the romance genre are negatively correlated.
 - Seinen, the nudity tag, and the ecchi genre are positively correlated.
 - Seinen and the romance genre are negatively correlated.
 - Josei and the primarily adult cast tag are positively correlated.