## Analysis for the downloaded Author and recommended author data.

Files used -

1. author_data - file with the names, author_url, image_url, about data.
2. author_mapping - which authors appear on which author's page.
    * column 1 - the author whose page is scraped.
    * column 2 - all the authors that appeared on the page of the author from column 1.
    
Colunm 1 after dropping duplicated would give some gender ratio. Column 2 (including
duplicates) will give some other gender ratio. As README.md suggests, these two ratios
would not match. Unless - 
1. The Nielson data doesn't generalise OR
2. Amazon recommendations balance the gender ratio.

In [1]:
from collections import Counter
import pandas as pd
import sys

sys.path.append('/home/sanhita/github/author_gender_ratio/')
from utils import extract_first_name, select_higher_freq_gender

csv_path = "../files/"

%reload_ext autoreload
%autoreload 2

### Author Data 
Author name, url, img_url, about

In [2]:
df_data = pd.read_csv(f'{csv_path}author_data.csv')
print(df_data.shape)
df_data.head()

(30980, 4)


Unnamed: 0,author,author_url,img_url,about
0,Chimamanda-Ngozi-Adichie,Chimamanda-Ngozi-Adichie/e/B00PODW5UG,https://m.media-amazon.com/images/S/amzn-autho...,CHIMAMANDA NGOZI ADICHIE's work has appeared i...
1,Bethel Joshua,/Bethel-Joshua/e/B0B9FP3WCG,https://m.media-amazon.com/images/S/amzn-autho...,
2,Jhumpa Lahiri,/Jhumpa-Lahiri/e/B001H6GTG0,https://m.media-amazon.com/images/I/61-3-dPgGf...,Jhumpa Lahiri was born in London and raised in...
3,Barbara A. Mowat,/Barbara-A-Mowat/e/B001IOH98I,https://m.media-amazon.com/images/I/51z8ocAA75...,Barbara A. Mowat (1934-2017) was the Director ...
4,Marjane Satrapi,/Marjane-Satrapi/e/B001IGFN9I,https://m.media-amazon.com/images/I/51z6DF+kRp...,"Marjane Satrapi was born in 1969 in Rasht, Ira..."


## Author mapping
Author in the first column, mapped with those in the second column

In [3]:
df_map = pd.read_csv(f'{csv_path}author_mapping.csv')
df_map[df_map.columns[0]].nunique(), len(df_map)

(10996, 203161)

In [4]:
len(df_map) / df_map[df_map.columns[0]].nunique()

18.47590032739178

In [5]:
df_map.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203161 entries, 0 to 203160
Data columns (total 2 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   author         203161 non-null  object
 1   mapped_author  203161 non-null  object
dtypes: object(2)
memory usage: 3.1+ MB


In [6]:
# The number of authors for whom author mapping is scraped
df_map['author'].nunique()

10996

In [7]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30980 entries, 0 to 30979
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   author      30980 non-null  object
 1   author_url  29909 non-null  object
 2   img_url     30980 non-null  object
 3   about       10996 non-null  object
dtypes: object(4)
memory usage: 968.2+ KB


## Extract the author gender from their about info.

In [8]:
(df_data['about'] == 'None').sum()

537

In [9]:
df_data[df_data['about'] == 'None']

Unnamed: 0,author,author_url,img_url,about
21,Art Spiegelman,/Art-Spiegelman/e/B000APXXEK,https://m.media-amazon.com/images/I/31BQEqvZpC...,
60,MARCO PEDULLA,/MARCO-PEDULLA/e/B01MQ0WRNZ,https://m.media-amazon.com/images/S/amzn-autho...,
93,Henry Louis Gates Jr.,/Henry-Louis-Gates-Jr/e/B000APNYEO,https://m.media-amazon.com/images/I/31d0yI9pxZ...,
110,Ayad Akhtar,/Ayad-Akhtar/e/B0052XYGXM,https://m.media-amazon.com/images/S/amzn-autho...,
141,Gerry Duggan,/Gerry-Duggan/e/B007XVQ05S,https://m.media-amazon.com/images/S/amzn-autho...,
...,...,...,...,...
21303,Norman Goldwasser PhD,/Norman-Goldwasser-PhD/e/B094DTS5XZ,https://m.media-amazon.com/images/S/amzn-autho...,
21319,Ilhan Omar,/Ilhan-Omar/e/B085XLN9LS,https://m.media-amazon.com/images/I/61eHl4zJJz...,
21328,Raffi Garnighian,/Raffi-Garnighian/e/B07QGRR5WS,https://m.media-amazon.com/images/S/amzn-autho...,
21392,Theodore Dimon,/Theodore-Dimon/e/B001JRZPTY,https://m.media-amazon.com/images/I/71UlD6xFcS...,


In [10]:
df_data.loc[df_data['about'].notna(), 'gender'] = df_data[df_data['about'].notna()]['about'].apply(select_higher_freq_gender)

### Gender Ratio for the scraped authors (excludes double counting)

In [11]:
df_data['gender'].value_counts(normalize=True, dropna=True)

1.0    0.518156
0.0    0.481844
Name: gender, dtype: float64

### Establishing genders of the authors recommended 

In [12]:
df_data.loc[:, 'first_name'] = df_data['author'].apply(extract_first_name)

In [13]:
gender_notna_filt = df_data['gender'].notna()

women = df_data[(df_data['first_name'].notna()) &
                (df_data['gender'] == 1)
               ]['first_name'].values

men = df_data[(df_data['first_name'].notna()) &
              (df_data['gender'] == 0)
             ]['first_name'].values


len(set(women).intersection(set(men))), len(set(women)), len(set(men))

(82, 1457, 940)

In [14]:
# The names for which both confusion_setders were identified
confusion_set = set(women).intersection(set(men))

In [15]:
df_data[df_data['first_name'].isin(confusion_set)][['first_name', 'gender']]

Unnamed: 0,first_name,gender
3,barbara,1.0
7,james,0.0
24,george,0.0
42,barbara,1.0
63,peter,
...,...,...
30929,david,
30930,charlie,
30936,peter,
30969,michael,


In [16]:
gender_dict = {}
for name in confusion_set:
     value_counts = df_data[df_data['first_name'] == name]['gender'].value_counts()
     gender_dict[name] = value_counts.idxmax()
        
gender_dict

{'lee': 0.0,
 'amy': 1.0,
 'carla': 1.0,
 'michael': 0.0,
 'amal': 1.0,
 'pat': 1.0,
 'kami': 1.0,
 'jaime': 1.0,
 'swami': 0.0,
 'britt': 0.0,
 'robin': 1.0,
 'codi': 1.0,
 'nikki': 0.0,
 'diana': 1.0,
 'ali': 1.0,
 'sheryl': 1.0,
 'chris': 0.0,
 'dale': 0.0,
 'craig': 0.0,
 'morgan': 1.0,
 'taylor': 1.0,
 'kit': 1.0,
 'joel': 0.0,
 'james': 0.0,
 'devin': 1.0,
 'barbara': 1.0,
 'janice': 1.0,
 'colin': 0.0,
 'riley': 1.0,
 'loren': 0.0,
 'lynn': 1.0,
 'shawn': 0.0,
 'rory': 0.0,
 'harriet': 1.0,
 'david': 0.0,
 'terry': 0.0,
 'shea': 1.0,
 'jean': 1.0,
 'alex': 0.0,
 'theo': 0.0,
 'bev': 0.0,
 'kris': 1.0,
 'caroline': 1.0,
 'jesse': 0.0,
 'jordan': 0.0,
 'kimberly': 1.0,
 'erika': 1.0,
 'bryce': 1.0,
 'cole': 0.0,
 'parker': 0.0,
 'tomi': 1.0,
 'noelle': 1.0,
 'jan': 1.0,
 'gregg': 1.0,
 'peter': 0.0,
 'skye': 0.0,
 'sam': 0.0,
 'mel': 0.0,
 'carey': 0.0,
 'bryan': 0.0,
 'tracy': 1.0,
 'toby': 1.0,
 'jess': 1.0,
 'jeffrey': 0.0,
 'kim': 1.0,
 'charlie': 0.0,
 'rhys': 1.0,
 'tessa': 

### One special case for names

In [17]:
for url in df_data[df_data['first_name'] == 'lane']['author_url']:
    print(url)

/Lane-Pederson/e/B0093OSJY8
/Lane-Smith/e/B000APX5G6
/Lane-R-Warenski/e/B01FWI6J2C
/Lane-Rebelo/e/B07CCKR9SP
/Lane-Walker/e/B09KYJ6WVM
/Lane-Hart/e/B00J22NZTA
/Lane-Wilcken/e/B005ONRYLG


In [18]:
for url in df_data[df_data['first_name'] == 'lane']['img_url']:
    print(url)

https://m.media-amazon.com/images/I/61VC7W7KSjL.jpg
https://m.media-amazon.com/images/S/amzn-author-media-prod/qqmn9cq193ot5t7a1ivq572dcs.jpg
https://m.media-amazon.com/images/I/81JriE1pZpL.jpg
https://m.media-amazon.com/images/S/amzn-author-media-prod/4s2irr7p01kpdlgjdn79nmh0mi.jpg
https://m.media-amazon.com/images/S/amzn-author-media-prod/8ccuvu3kf1nrhb42bsmir5vo3i.jpg
https://m.media-amazon.com/images/S/amzn-author-media-prod/2bjoqg4u4nm2snut2r05g75rq6.jpg
https://m.media-amazon.com/images/I/71Vm4ERaYJL.jpg


In [19]:
df_data[
    (~df_data['first_name'].isin(confusion_set)) &
    (df_data['gender'].notna())
][['first_name', 'gender']]

Unnamed: 0,first_name,gender
0,chimamanda,1.0
2,jhumpa,1.0
4,marjane,1.0
6,yaa,1.0
8,louise,1.0
...,...,...
21440,sarah,1.0
21444,meg,1.0
21445,thomas,0.0
21448,sidonie,1.0


In [20]:
for key, val in df_data[
    (~df_data['first_name'].isin(confusion_set)) &
    (df_data['gender'].notna())
][['first_name', 'gender']].iterrows():
    gender_dict[val.first_name] = val.gender

In [21]:
women = 0
men = 0
for key, val in gender_dict.items():
    if val == 1:
        women += 1
    if val == 0:
        men += 1
print(women, men)

1418 898


In [22]:
len(gender_dict) - women

898

In [23]:
df_data['gender'].value_counts(normalize=True)

1.0    0.518156
0.0    0.481844
Name: gender, dtype: float64

In [24]:
df_data['gender'].value_counts()

1.0    4024
0.0    3742
Name: gender, dtype: int64

In [25]:
df_map.loc[:, 'mapped_first_name'] = df_map['mapped_author'].apply(extract_first_name)

In [26]:
df_map.loc[:, 'author_first_name'] = df_map['author'].apply(extract_first_name)

In [27]:
df_map.head()

Unnamed: 0,author,mapped_author,mapped_first_name,author_first_name
0,Chimamanda-Ngozi-Adichie,Bethel Joshua,bethel,chimamanda
1,Chimamanda-Ngozi-Adichie,Jhumpa Lahiri,jhumpa,chimamanda
2,Chimamanda-Ngozi-Adichie,Barbara A. Mowat,barbara,chimamanda
3,Chimamanda-Ngozi-Adichie,Marjane Satrapi,marjane,chimamanda
4,Chimamanda-Ngozi-Adichie,Khaled Hosseini,khaled,chimamanda


In [28]:
df_map['author_gender'] = df_map['author_first_name'].map(gender_dict)
df_map['mapped_author_gender'] = df_map['mapped_first_name'].map(gender_dict)

In [29]:
df_map.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203161 entries, 0 to 203160
Data columns (total 6 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   author                203161 non-null  object 
 1   mapped_author         203161 non-null  object 
 2   mapped_first_name     197957 non-null  object 
 3   author_first_name     198037 non-null  object 
 4   author_gender         188635 non-null  float64
 5   mapped_author_gender  176244 non-null  float64
dtypes: float64(2), object(4)
memory usage: 9.3+ MB


In [30]:
df_map[df_map['mapped_author_gender'].isna()]['author_gender'].value_counts(normalize=True)

1.0    0.535756
0.0    0.464244
Name: author_gender, dtype: float64

## Where authors' gender is labeled

In [31]:
df_map[df_map['author_gender'].notna()]['author_gender'].value_counts(normalize=True)

1.0    0.526679
0.0    0.473321
Name: author_gender, dtype: float64

In [32]:
df_map[df_map['mapped_author_gender'].isna()]['mapped_author'].nunique()

5206

In [33]:
df_map_filt = df_map[df_map['author_gender'].notna()]

In [34]:
df_map_filt[df_map_filt['mapped_author_gender'].isna()]['mapped_author']

0                Bethel Joshua
4              Khaled Hosseini
11                       Homer
12                Mohsin Hamid
15             Khaled Hosseini
                  ...         
203084     Sheree Renée Thomas
203089    Jewell Parker Rhodes
203112            André Aciman
203133              SparkNotes
203148      Kivi Leroux Miller
Name: mapped_author, Length: 23339, dtype: object

In [35]:
df_map_filt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 188635 entries, 0 to 203160
Data columns (total 6 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   author                188635 non-null  object 
 1   mapped_author         188635 non-null  object 
 2   mapped_first_name     183806 non-null  object 
 3   author_first_name     183511 non-null  object 
 4   author_gender         188635 non-null  float64
 5   mapped_author_gender  165296 non-null  float64
dtypes: float64(2), object(4)
memory usage: 10.1+ MB


In [36]:
for gender in [1, 0]: 
    print(gender, df_map_filt[df_map_filt['author_gender'] == gender]['author'].nunique())

1 5077
0 5157


In [37]:
for gender in [1, 0]: 
    print(gender, (df_map_filt['author_gender'] == gender).sum() / df_map_filt[df_map_filt['author_gender'] == gender]['author'].nunique())

1 19.56864289935001
0 17.31336048089975


The average number of mappings for women authors is 19.6 and 17.3 for men authors. 

## Find the gender ratio for the mapped men and women authors.

In [41]:
# for women authors
gender = 1.0
author_gender_filt = (df_map_filt['author_gender'] == gender)

df_map_filt[author_gender_filt]['mapped_author_gender'].value_counts(normalize=True)

1.0    0.717477
0.0    0.282523
Name: mapped_author_gender, dtype: float64

In [40]:
# For men authors
gender = 0.0
author_gender_filt = (df_map_filt['author_gender'] == gender)

df_map_filt[author_gender_filt]['mapped_author_gender'].value_counts(normalize=True)

0.0    0.726042
1.0    0.273958
Name: mapped_author_gender, dtype: float64