# Setup

Importing modules.

In [None]:
import pandas as pd
import numpy as np
from itables import init_notebook_mode
import missingno as msno
import seaborn as sns

Loading in data from csv.

In [None]:
DF_ORIGINAL = pd.read_csv("../data/manga.csv")

Filling all null tag and genre features with zero, as that is their implicit value.

In [None]:
ZEROABLE_NUMERIC_COLUMN_NAMES = list(DF_ORIGINAL.drop(["id","chapters","volumes","start_year","start_month","start_day","end_year","end_month","end_day"],axis=1).select_dtypes(include=['number']).columns.values)
DEFAULT_ZEROES = [0] * len(ZEROABLE_NUMERIC_COLUMN_NAMES)
NULL_MAP = dict(zip(ZEROABLE_NUMERIC_COLUMN_NAMES, DEFAULT_ZEROES))
ZEROED_DF = DF_ORIGINAL.fillna(value=NULL_MAP)

Computing target feature from data.

In [None]:
SHOUNEN_TAG_PCT = ZEROED_DF["Shounen"]
SHOUJO_TAG_PCT = ZEROED_DF["Shoujo"]
SEINEN_TAG_PCT = ZEROED_DF["Seinen"]
JOSEI_TAG_PCT = ZEROED_DF["Josei"]

demo_col = []

for x in range(ZEROED_DF.shape[0]):
    demo_dict = {"Shounen": SHOUNEN_TAG_PCT[x],
                 "Shoujo": SHOUJO_TAG_PCT[x],
                 "Seinen": SEINEN_TAG_PCT[x],
                 "Josei": JOSEI_TAG_PCT[x]}
    
    if sum(demo_dict.values()) > 0:   
        max_vk = max(((v, k) for (k, v) in demo_dict.items()))
        demo_label = max_vk[1]
    else:
        demo_label = None
    
    demo_col.append(demo_label)
    del demo_label
    
DEMO_ADDED_DF = (ZEROED_DF
                 .assign(demo_label = pd.Series(demo_col))
                 .drop(["Shounen","Shoujo","Seinen","Josei"], axis=1))

# Full-dataset summaries

Viewing a few records of data frame.

In [None]:
DEMO_ADDED_DF.head()

In [None]:
DEMO_ADDED_DF.tail()

Checking shape of data.

In [None]:
DEMO_ADDED_DF.shape

Checking summary statistics on numeric features, excluding the media ID.

In [None]:
# Display data frames interactively
init_notebook_mode(all_interactive=True)

# Computing summary statistics for numeric columns, excluding the ID, adding in an explicit percentage of null values.
summary_df = DEMO_ADDED_DF.drop("id", axis=1).describe(include=np.number)
summary_df.loc["pct_null"] = [f"{pct:0.2%}" for pct in DEMO_ADDED_DF
                              .drop("id", axis=1)
                              .select_dtypes(include=np.number)
                              .isna().mean().tolist()]
summary_df = summary_df.transpose()
summary_df

Checking summary statistics for categorical features as well, excluding names of media.

In [None]:
def cat_summary_frame(df, colname, naincl = False):

    """
    Within the dataframe passed as the first argument to this function, 
    summarizes the count of each level of the feature whose name is 
    passed as the second argument, as well as the percentage of total 
    observations each count represents.
    """

    summ_df = df.groupby(colname, dropna = naincl).size().to_frame().rename(columns={0: "count"})
    summ_df = summ_df.assign(pct = round((summ_df["count"] / summ_df["count"].sum()) * 100, 2))
    summ_df = summ_df.assign(pct = summ_df["pct"].astype("string"))
    summ_df = summ_df.assign(pct = summ_df["pct"] + "%")
    return summ_df.sort_values("count", ascending=False).reset_index()

cat_summary_frame(DEMO_ADDED_DF, "status")

In [None]:
cat_summary_frame(DEMO_ADDED_DF, "source")

In [None]:
cat_summary_frame(DEMO_ADDED_DF, "country")

In [None]:
cat_summary_frame(DEMO_ADDED_DF, "demo_label")

In [None]:
# Running demo again to see the percentages of each label without missing data included.
cat_summary_frame(DEMO_ADDED_DF, "demo_label", True)

# Examining Missingness

Using visualizations to identify patterns in missingness.

In [None]:
features_with_missingness = DEMO_ADDED_DF.isna().any(axis=0).to_frame().rename(columns={0: "is_na"}).query("is_na").index.tolist()
df_for_missing_dendogram = DEMO_ADDED_DF[features_with_missingness]
df_for_missing_matrix = df_for_missing_dendogram.sample(200, random_state=1234)
msno.matrix(df_for_missing_matrix)

In [None]:
msno.heatmap(DEMO_ADDED_DF)

In [None]:
msno.dendrogram(df_for_missing_dendogram)

Through these visualizations we can see that there are two main groupings of features whose missingness is relatively highly correlated:
 - start_month, start_day, start_year, status, and source
 - end_year, end_month, end_day, chapters, and volumes

I would hypothesize that much of the missingness in the first group is negatively correlated with the level of a manga's popularity. The reason for this is that I suspect these are cases where manga with smaller fanbases have fewer people willing to research and/or contribute data to complete the start date, status, or source of that particular manga.

I would also hypothesize that the missingness of the second group is correlated with the status of a manga, because oftentimes the convention on the site is to leave things like the end date and number of chapters and volumes blank until the work has reached its conclusion.

We will display a correlation heatmap that includes missingness on one axis and features that have a correlation of at least .2 with at least one of the missingness indicators on the other axis in order to investigate these hypotheses.

In [None]:
missing_indicator_df = df_for_missing_dendogram.isna().add_prefix("MISSING_")
dummy_df = pd.get_dummies(DEMO_ADDED_DF.drop(["eng_title","rom_title","id"], axis="columns"))
missingness_corr = pd.concat([missing_indicator_df, dummy_df], axis="columns").corr()[missing_indicator_df.columns.tolist()]
high_corr_features = missingness_corr.abs().ge(.2).any(axis="columns").to_frame().rename(columns={0: "high_corr"}).query("high_corr").filter(regex="^(?!MISSING_).*", axis=0).index.tolist()
missingness_corr_2 = missingness_corr.filter(items=high_corr_features, axis=0)
sns.heatmap(missingness_corr_2, cmap="bwr", center=0)

##### Group 1
My hypothesis about the first group can likely be rejected-- missingness in the start month and day seem to be moderately negatively correlated with start_year and end_year, likely indicating that older works are more likely to have missing data in these features than newer works. Missingness in start_year, status and source did not clearly correlate with any features.

##### Group 2
We likely cannot reject my hypothesis about the second group, as missingness in chapters, volumes, and end dates are highly positively correlated with a status of "releasing" and highly negatively correlated with a status of "finished". This provides us with valuable insight for how we might handle missing values in these features in the future.

##### Other Features
Source missingness seems to be slightly positively correlated with finished manga, and slightly negatively correlated with start_year. This might indicate that older manga (which might be more likely to be finished) are less likely to have fans on the site who have taken the time to add a source label.

Missingness in the demographic label (which will comprise the manga whose labels we would like to predict) is positively correlated with being from Korea, the full color tag, and the boys' love tag. The country correlation makes sense to me, as I believe the demographic labels we are using for this project are largely a Japanese invention. For the full color tag, I hypothesize that a manga being from Korea and being full color is highly correlated, based on my previous experience as an AniList user. Consulting with a colleague about the boys' love association, it seems that boys' love may be in some ways be considered a demographic category in its own right, as it sounds like there may be [boys' love specific magazines](https://en.wikipedia.org/wiki/Dear%2B) in Japan.

##### Further Research
More investigation into the mechanism for missingness in the future may involve additional sampling from outside AniList to find the characteristics of the missing data, and whether it differs from the data we have access to. Additionally, performing Little's Test in order to check if data is missing completely at random. Finally, if it seems likely that we have missing at random data, validating this by checking the characteristics of missingness within defined groups.

# Correlation

Next, we will explore correlation in our data, both between pairs of features and between each feature and our label. We will also check my earlier hypothesis about the correlation between Korean manga and the full color tag.

In [None]:
correlation = dummy_df.corr()
upper_corr = correlation.where( 
    np.triu(np.ones(correlation.shape), k=1).astype(np.bool))
unstacked_corr = upper_corr.unstack().dropna()
unstacked_corr_2 = unstacked_corr
unstacked_corr_2.index = unstacked_corr.index.map(' & '.join)
sorted_unstacked_corr = unstacked_corr_2.reindex(unstacked_corr_2.abs().sort_values(ascending=False).index)

def highest_correlation(data, threshold, regex=".*"):
    """
    Takes the series passed to this function and outputs a list of correlated
    pairs that pass the desired threshold of correlation and include the
    variable name passed in regex.
    """

    return data.filter(regex=regex).loc[lambda x: (x > threshold) | (x < -threshold)]

highest_correlation(data=sorted_unstacked_corr, threshold=0.75)

In [None]:
highest_correlation(data=sorted_unstacked_corr, threshold=0.15, regex="demo_label")

In [None]:
highest_correlation(data=sorted_unstacked_corr, threshold=0.05, regex="Josei")

In [None]:
highest_correlation(data=sorted_unstacked_corr, threshold=0.15, regex="Full Color")

The list of the most highly correlated pairs reveals some patterns that may be important for feature reduction and feature extraction later on:
 - There are a number of features with perfect correlation, likely due to the gender field on characters in one particular manga or franchise containing several unique values. 
 - Features that scale along with the popularity of the manga (such as the score bucket features, user status features, and favorites) tend to be highly correlated with each other, as you might expect. This means that these features provide little new information except in context with one another.
 - End year and start year are highly correlated.
 - the anime relation and the adaptation relation are highly correlated, likely indicating that if a manga is adapted into something else that shows up on the site it will almost always be an anime.
 - Male background roles and female background roles both scale with total background roles. This likely happens to a lesser extent with supporting and main roles as well. These features mainly provide additional information in context with one another, as with the popularity-scaling features.
 - Categorical indicators tend to be negatively correlated with one another, as we would expect.

Some interesting observations regarding correlations with our labels:
 - Shoujo and the romance genre, heterosexual tag, and school tag are positively correlated
 - Shoujo and the action genre and the start year are negatively correlated.
 - Shounen and number of volumes, the action genre, having an anime adaptation, and the size of supporting and main casts are positively correlated.
 - Shounen and the boys' love tag are negatively correlated.
 - Seinen, the nudity tag, and the ecchi genre are positively correlated.
 - Seinen and the romance genre are negatively correlated.
 - Josei and the primarily adult cast tag, office tag, office lady tag, slice of life genre, female protagonist tag, and work tag are positively correlated.
 - Josei and the action genre are negatively correlated.

Finally, there is fairly strong evidence of my hypothesis that the full color tag is strongly correlated with Korean manga, with a correlation of 0.63 in our sample. Also relevant, Chinese manga has an 0.32 correlation with the tag.