# **Gender Impact: Exploring the extent of gender imbalances in Genshin Impact's character roster**




## 1 Introduction

To what extent do female players of Genshin impact experience a lack of content catered for the female gaze?

Genshin Impact in a highly-acclaimed open-world action roleplaying game. Although it is free to play, it features paid options that feed into the luck-based mechanics used to collect visually appealing characters, earning an estimated [$2.3 and 3.7 billion](https://screenrant.com/genshin-impact-fortnite-gta5-first-year-revenue/) in revenue across all platforms in the first year of release. Since character collection is a major component of the game for combat (and a main source of revenue for the company), this analysis will be focused on the preferences of the playerbase and how they are (and are not) reflected in the characters released.

As of 2023, [45% of the playerbase self-identified as female](https://prioridata.com/data/genshin-impact-player-count/), many of which who prefer at least a variety of characters to choose from. However, there has been [criticism]('https://gamerant.com/genshin-impact-needs-more-male-characters/#:~:text=Genshin%20Impact%20players%20have%20expressed,total%20of%2093%20playable%20characters) from frustrated players regarding the lack of playable male characters (and the repeated use fanservice meant to appeal to a male audience). While female characters could be designed for the [female gaze](https://sartorialmagazine.com/lifestyle/2023/2/17/the-female-gaze), this analysis focuses on the lack of male characters since this gender imbalance has been the center of public discourse.

This analysis seeks to answer the following questions: 
   - What is the gender distribution overall for all characters, and has this changed over time?
   - For male characters that are available, do they provide a enough varied gameplay compared to female characters?
   - How accessible are male characters in the game compared to female? Are they concentrated only at higher rarities?                                                                 

## Data Collection

I initially found the Genshin Impact character data already collected and published on Kaggle (current as of August 2024), so the dataset I used in this analysis was pulled from Kaggle via their API. Later on, I found that the Genshin Impact character data was publicly available through the [Genshin Impact Fandom Wiki](https://genshin-impact.fandom.com/wiki/Character/List). In the long-term, I would pull directly from the Wiki to always have the latest information on characters. This character list was the only data source I used.

This tabular dataset features 83 characters, not including the default main characters and Aloy (who was part of a limited collaboration with Sony). It features both qualitative and quantitative data. However, my analysis focuses primarily on dates and qualitative data.

My original plan was completely different from this analysis. I initally planned to analyze data collected across the past year from my cats' Litter Robots. However, this data was not publicly accessible via a website or API, and there was no straightforward way for me to integrate a public dataset into an analysis of my Litter Robots given the timeline. I decided to pursue a dataset that I am familiar with and highlight a topic that I am passionate about.

Below is a sample of the raw data.

In [139]:
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
import plotly 
plotly.offline.init_notebook_mode(connected=True) 
import numpy as np

import plotly.graph_objects as go
import numpy as np
import datetime
import os

In [157]:
df_raw = pd.read_csv('../data/raw/characters_dataset.csv')
df = df_raw.copy()
df = df[['character_name', 'star_rarity', 'region', 'vision',
       'weapon_type', 'release_date', 'model', 'limited']]
    
df['height'] = df['model'].str.split(' ', expand=True)[0]
df['gender'] = df['model'].str.split(' ', expand=True)[1]
df['release_year'] = df['release_date'].str.split('-', expand=True)[0]
df['star_rarity'] = np.where(df['star_rarity']==5, '5-Star', '4-Star')
df = df[df['character_name']!='Aloy']
df = df[~df['character_name'].str.contains('Traveler')]
df_raw.head()

Unnamed: 0,character_name,star_rarity,region,vision,arkhe,weapon_type,release_date,model,constellation,birthday,...,atk_1_20,def_1_20,ascension_special_stat,special_0,special_1,special_2,special_3,special_4,special_5,special_6
0,Albedo,5,Mondstadt,Geo,,Sword,2020-12-23,Medium Male,Princeps Cretaceus,13-Sep,...,20,68,Geo DMG Bonus,0.00%,0.00%,7.20%,14.40%,14.40%,21.60%,28.80%
1,Alhaitham,5,Sumeru,Dendro,,Sword,2023-01-18,Tall Male,Vultur Volans,11-Feb,...,24,60,Dendro DMG Bonus,0.00%,0.00%,7.20%,14.40%,14.40%,21.60%,28.80%
2,Aloy,5,,Cryo,,Bow,2021-09-01,Medium Female,Nora Fortis,04-Apr,...,18,53,Cryo DMG Bonus,0.00%,0.00%,7.20%,14.40%,14.40%,21.60%,28.80%
3,Amber,4,Mondstadt,Pyro,,Bow,2020-09-28,Medium Female,Lepus,10-Aug,...,19,50,ATK,0.00%,0.00%,6.00%,12.00%,12.00%,18.00%,24.00%
4,Arataki Itto,5,Inazuma,Geo,,Claymore,2021-12-14,Tall Male,Taurus Iracundus,01-Jun,...,18,75,CRIT Rate,0.00%,0.00%,4.80%,9.60%,9.60%,14.40%,19.20%


## 3 Data Cleaning

One the first steps I took to clean the data was to decide which columns were needed for the scope of the analysis. The raw data contained 89 columns, and I kept 8.

I filtered any characters/variations of characters that players are unable to obtain through the luck-based mechanic. This includes multiple forms of the main character (male/female form and element associated with the main character) and Aloy, who came from a collaboration with Sony in 2021 and can no longer be obtained by new players. 

I also processed the text from the following columns:
- model: Contained gender and model body type. I performed a string split to grab only gender
- release_year: Extracted from full release_date string (I also grabbed release_month for a later visualization)
- star_rarity: This column was a number (either 4 or 5), so I converted it to a string ("4-Star" and "5-Star") since we are treating them as categories in this analysis

For a later visualization, I created a conditional field called "genders_released" that indicates genders of characters that were released in a given month (Female-only, Male-only, Female + Male). This is used to visualize the gaps in time between when male characters get rolled out (and compare them to the release schedule of female characters).

For each data visualization, I created a copy of the processed data, and I aggregated the data to get counts by groups (and their corresponding percentages).

Below is a sample of the processed data.

In [163]:
df.head()

Unnamed: 0,character_name,star_rarity,region,vision,weapon_type,release_date,model,limited,height,gender,release_year
0,Albedo,5-Star,Mondstadt,Geo,Sword,2020-12-23,Medium Male,True,Medium,Male,2020
1,Alhaitham,5-Star,Sumeru,Dendro,Sword,2023-01-18,Tall Male,True,Tall,Male,2023
3,Amber,4-Star,Mondstadt,Pyro,Bow,2020-09-28,Medium Female,False,Medium,Female,2020
4,Arataki Itto,5-Star,Inazuma,Geo,Claymore,2021-12-14,Tall Male,True,Tall,Male,2021
5,Arlecchino,5-Star,Snezhnaya,Pyro,Polearm,2024-04-24,Tall Female,True,Tall,Female,2024


## 4 Data Analysis

For any sort of analysis, I approach the dataset from a "high to low" perspective: I start with very broad exploration at a high level and drill down to certain details that look interesting later on. This allows me to "peel the layers" of the dataset and uncover insights while always having context at a high level.

### 1. Quantifying the overall gap between female and male characters
- 64% of characters are female while 36% are male
- The percentage of male characters have decreased overall every year since the game launched. However, in 2023, the 53% of characters released were male while 47% were female. This is the only year where male characters had a higher percentage. Even though the data is only current up until August 2024, only two additional male characters have been released (along with new female characters) from then until now.

### 2. Identifying areas key areas of improvement
- Each character possesses an element and weapon type that corresponds with their combat abilities
- Elements: There were no elements where males outnumbered females. Most elements at least had a percentage of males that outnumbered the percentage of males in the overall population (36%). Anemo was the most balanced element with 55% female and 46% male, while Electro was least balanced with 77% felame amd 23% male. Pyro and Hydro both had 33% males.
- Weapons: There were no weapons where where males outnumbered females. However, claymore-wielding characters had the most balanced gender distribution with 53% female and 47% male. Catalyst wielders were the least balance with 71% female and 29% male.


### 3. Assessing accessibility of female and male characters
- There are two levels of rarity for characters: 4-stars and 5-stars. 5-stars have better combat stats, but they have a lower probability of being pulled (especially without players spending real-life currency).
- The balance between 4-stars and 5-stars overall is very balanced with 48% of characters being 4-stars and 52% being 5-stars.
- This balance is not reflected when drilling down to gender. Female characters are balanced with 51% 4-stars and 49% 5-stars. Male characters had 43% 4-stars and 57% 5-stars. On top of being released less frequently, male characters are overall more difficult to obtain for free-to-play and low-spenders.
- The longest gap between new male characters being released is 4 months, while the longest gap between new female characters being released is 2 months, indicating that players who want make characters typically have to wait longer between releases.

## 5 Visualizations

Each visualization corresponds to a key finding.

### 1. Quantifying the overall gap between female and male characters

I used bar charts for for these visuals because they are an efficient way to display many categories in one view while being digestible.

The colors indicate the gender while the bars indicate count of characters. I included labels to also share percentage breakdown.

In [162]:
df_grouped=df.groupby('gender').count().reset_index()
df_grouped = df_grouped[['gender', 'character_name']].rename(columns={'character_name':'count'})
df_grouped['percentage'] = round(df_grouped['count']/df_grouped['count'].sum()*100,1)

fig = px.bar(df_grouped,
            x='count',
            y='gender',
            title='Female vs Male Character Counts 2020 - 2024',
            color='gender',
            hover_data=['gender', 'count', 'percentage'],
            barmode='stack', width=800, height=600, text_auto=False,
            text=['{} ({:.0%})'.format(v, p/100) for v,p in zip(df_grouped['count'], df_grouped['percentage'])],
            color_discrete_map={
                'Male': '#2280B8',
                'Female': '#F29A8B'},
            orientation='h'
            )

fig.update_traces(width = .2)
fig.update_layout( plot_bgcolor='#F1EEEE')
# plot
fig.show()

Bar charts are also an effective way to observe changes over time between categories.

In [142]:
# has the gender distribution changed over time?
df_grouped=df.groupby(['release_year', 'gender']).count().reset_index()
df_grouped = df_grouped[['release_year','gender', 'character_name']].rename(columns={'character_name':'count'})
df_grouped['percentage'] = round(df_grouped['count']/df_grouped.groupby(['release_year'])['count'].transform('sum')*100,1)
fig = px.bar(df_grouped, x='release_year', y='count',
             title='Female vs Male Character Counts over Time, 2020 - 2024',
             hover_data=['gender', 'count', 'percentage'], color='gender',
             barmode='group', width=800, height=600, text_auto=False,
             text=['{} ({:.0%})'.format(v, p/100) for v,p in zip(df_grouped['count'], df_grouped['percentage'])],
             color_discrete_map={'Male': '#2280B8','Female': '#F29A8B'}
                )
fig.update_traces(textposition='outside')
fig.update_layout( plot_bgcolor='#F1EEEE')
fig.show()

### 2. Identifying areas key areas of improvement

I also opted for bar charts for these visuals, using the same color mapping for gender. These visualizations feature stacked bars for conciseness.

In [143]:
# Is there a group with the most equal distribution? Are they concentrated in certain weapons? Vision types?
df_grouped = df.groupby(['vision', 'gender']).count().reset_index()
df_grouped = df_grouped[['vision', 'gender', 'character_name']].rename(columns={'character_name':'count'})
df_grouped['percentage'] = round(df_grouped['count']/df_grouped.groupby(['vision'])['count'].transform('sum')*100,1)

df_grouped
fig = px.bar(df_grouped, x="count", y="vision", color="gender", hover_data=['gender', 'count', 'percentage'],
            title = 'Comparing Gender Distribution Across Elements, 2020 - 2024',
            width=800, height=600, text_auto=False,
            text=['{} ({:.0%})'.format(v, p/100) for v,p in zip(df_grouped['count'], df_grouped['percentage'])],
            color_discrete_map={
                'Male': '#2280B8',
                'Female': '#F29A8B'},
            orientation='h',
             labels={
                     "vision": "Element"}

)

fig.update_traces(width = .5)
fig.update_layout( plot_bgcolor='#F1EEEE')
fig.update_layout(yaxis={'categoryorder': 'total ascending'})


fig.show()

In [144]:
# Is there a element with the most equitable distribution? Are they concentrated in certain weapons? Vision types?
df_grouped = df.groupby(['weapon_type', 'gender']).count().reset_index()
df_grouped = df_grouped[['weapon_type', 'gender', 'character_name']].rename(columns={'character_name':'count'})
df_grouped['percentage'] = round(df_grouped['count']/df_grouped.groupby(['weapon_type'])['count'].transform('sum')*100,1)

df_grouped
fig = px.bar(df_grouped, x="count", y="weapon_type", color="gender", hover_data=['gender', 'count', 'percentage'],
            title = 'Comparing Gender Distribution Across Weapon Types, 2020 - 2024',
            width=800, height=600, text_auto=False,
            text=['{} ({:.0%})'.format(v, p/100) for v,p in zip(df_grouped['count'], df_grouped['percentage'])],
            color_discrete_map={
                'Male': '#2280B8',
                'Female': '#F29A8B'},
            orientation='h'

)

fig.update_traces(width = .4)
fig.update_layout( plot_bgcolor='#F1EEEE')
fig.update_layout(yaxis={'categoryorder': 'total ascending'})

fig.show()

### 3. Assessing accessibility of female and male characters

I opted for pie and suburst charts here because I wanted to emphasize the difference in proportions rather than raw counts (which we already covered in previous visualizations).

The colors corresponded to the character rarity levels.

In [145]:
# what do character rarities look like? 
df_grouped=df.groupby(['star_rarity']).count().reset_index()
df_grouped = df_grouped[['star_rarity', 'character_name']].rename(columns={'character_name':'count'})
df_grouped['percentage'] = round(df_grouped['count']/df_grouped.groupby(['star_rarity'])['count'].transform('sum')*100,1)

fig = px.pie(
    df_grouped, 
    values='count', 
    names='star_rarity', 
    color = 'star_rarity', 
    width=600, 
    height=400, 
    title = 'Overall Count of Characters by Star-Rarity, 2020 - 2024',
    color_discrete_map={'4-Star':'#655A7C','5-Star':'#ffd166'}
)

fig.update_traces(textposition='inside', textinfo='percent+label', insidetextorientation = 'radial', textfont_size=10)

fig.show()

This sunburst chart shows the breakdown of character rarities within each gender group. I selected this chart because it efficiently highlights the imbalance of gender AND character rarity (specifically for male characters) in one visual.

In [164]:
df_grouped = df.groupby(['star_rarity', 'gender']).count().reset_index()
df_grouped = df_grouped[['star_rarity', 'gender', 'character_name']].rename(columns={'character_name':'count'})
df_grouped['percentage'] = round(df_grouped['count']/df_grouped.groupby(['star_rarity'])['count'].transform('sum')*100,1)

fig = px.sunburst(
    df_grouped, 
    path=['gender', 'star_rarity'], 
    values='count', 
    color='star_rarity',
    width=600, 
    height=400, 
    title = 'Breakdown of Overall Character Rarities, 2020 - 2024',
    )
fig.update_traces(
    textinfo="label+percent parent",
    insidetextorientation='horizontal',
    textfont_size=10
)
color_mapping = {'Male': '#2280B8','Female': '#F29A8B', '4-Star':'#655A7C','5-Star':'#ffd166'}
fig.update_traces(marker_colors=[color_mapping[cat] for cat in fig.data[-1].labels])

fig.show()

To illustrate gaps between new characters, I decided to create a heatmap. The x-axis contains release month (using month number) and the y-axis contains release year.

Each rectangle has a color that indicates the genders released that month: 1 = Female Only, 2 = Male Only, 3 = Female and Male. Boxes with no color (i.e., grey) mean that no new characters were released.

I decided to use this visual because I wanted to highlight how frequently new female characters get released vs males.

I struggled with creating a custom mapping for discrete values. Given more time, I would have improved the color scale to show discrete values rather than continuous.

In [168]:

df_cal = df.copy()
df_cal['release_date'] = pd.to_datetime(df_cal['release_date'])
df_cal['release_month'] = df_cal['release_date'].dt.strftime('%m')

# check if month has both male and female characters released
df_cal['genders_released'] = df_cal.groupby(['release_year', 'release_month'])['gender'].transform(lambda x: 'Male + Female' if set(['Male', 'Female']).issubset(x) else np.nan)

df_cal['genders_released1'] = df_cal.groupby(['release_year', 'release_month'])['gender'].transform(lambda x: 'Male Only' if x.nunique() == 1 and x.iloc[0] == 'Male' else np.nan)
df_cal['genders_released2'] = df_cal.groupby(['release_year', 'release_month'])['gender'].transform(lambda x: 'Female Only' if x.nunique() == 1 and x.iloc[0] == 'Female' else np.nan)

df_cal['genders_released'] = df_cal['genders_released'].fillna(df_cal['genders_released1'])
df_cal['genders_released'] = df_cal['genders_released'].fillna(df_cal['genders_released2'])
df_cal.drop(['genders_released1', 'genders_released2'], axis=1, inplace=True)
df_cal_grouped = df_cal.groupby(['release_year','release_month', 'genders_released']).count().reset_index()
df_cal_grouped = df_cal_grouped[['release_year','release_month', 'genders_released','character_name']].rename(columns={'character_name':'count'})

df_cal_grouped['gender_released_code'] = np.where(df_cal_grouped['genders_released']=='Male + Female', 3, df_cal_grouped['genders_released'] )
df_cal_grouped['gender_released_code'] = np.where(df_cal_grouped['gender_released_code'] =='Male Only', 2,  df_cal_grouped['gender_released_code'] )
df_cal_grouped['gender_released_code'] = np.where(df_cal_grouped['genders_released']=='Female Only', 1, df_cal_grouped['gender_released_code'] )


release_year = list(df_cal_grouped['release_year'])

base = datetime.datetime.today()
release_month = list(df_cal_grouped['release_month'])
z = list(df_cal_grouped['gender_released_code'])

fig = go.Figure(data=go.Heatmap(
        z=z,
        x=release_month,
        y=release_year,
        colorscale='ylgnbu', hoverinfo=['y', 'x', ]))


fig.update_layout(
    title=dict(text='Genders Released per Month (Color Code: 1 = F Only, 2 = M Only, 3 = F + M)'),
    xaxis_nticks=36)

fig.update_layout(hovermode="x")

fig.update_layout(xaxis={'categoryorder': 'category ascending'})
fig.update_layout(yaxis={'categoryorder': 'category descending'})

fig.show()


## 6 Conclusion

The data confirms that there truly is an imbalance in how characters are released.

With 45% of the playerbase being female but only 36% of characters being male, this highlights how a section of the playerbase feels unseen. This gap has also gotten worse over time. There are specific areas of gameplay that could benefit from more diversity. Electro, Pyro, and Hydro elements could include more male characters, and the same can be said for catalyst-wielders. The male characters released so far have not been accessible for free and low-spenders. This discourages a large portion of the playerbase who do not have expendable income for obtaining these characters.

Overall, with the decreasing trend with male character releases, there are players who may decide to leave the game to play other games that may cater more to their preferences.

## 7 Future Work
Given more time, I would incorporate additional data sources. I am particularly interested in character revenue data. One potential argument against releasing male characters is that they might not sell as well as female characters. However, this may not neccesarily be the case. I would want to look at actual sales data to validate this.

Additionally, I would want to gather data on the spending level of the playerbase and how this differs between genders. Considering that male characters are more difficult to obtain, this could alienate free-to-low spenders.

Finally, I am interested in gathering social media data (Twitter posts, Reddit posts, etc.) talking about gender imbalances in Genshin Impact. This would help highlight how the community is affected by the developer's decisions.