# Hipótesis

Tenemos:

1) Hay privilegio para los grupos de conectar con SC
2) Éste privilegio está sesgado en favor de los hombres

Tenemos tres hipótesis
 1) Las mujeres se salen de la carrera
 
 2) Hay una diferencia de conexión local a SC entre hombres y mujeres
 
 3) Hay otro sesgo de género presente que no tiene que ver con estructurada de redes

# Hipotesis 2

Hay una diferencia en la conectividad local entre los que están conectados con SC:

1. Los exitosos 5 años después tienen más conexiones
2. Los exitosos 5 años después tienen más colaboraciones

In [6]:
import pandas as pd
import networkx as nx
import numpy as np

### Citation table

We take cites from 1990 to 2019 avoiding self-cites.

In [7]:
cite = pd.read_csv("../data/processed/cites.csv")

cite = cite[(cite.t_year >= 1990) & (cite.t_year < 2020)]
cite = cite[(cite.s_year >= 1990) & (cite.s_year < 2020)]

cite = cite[cite.target != cite.source]

In [8]:
papers = pd.read_csv("../data/processed/adjacency_papers.csv")

papers = papers[(papers.t_year >= 1990) & (papers.t_year < 2020)]
papers = papers[(papers.s_year >= 1990) & (papers.s_year < 2020)]

papers = papers[papers.target != papers.source]

### Authors table

There are two author tables: 

1. One with the comparable groups A and B (`people`) and 
2. the one with the all the authors found in the RePEc repository (`all_people`).

In [9]:
people = pd.read_csv("../data/processed/network_people.csv")
all_people = pd.read_csv("../data/processed/people.csv")

### Places table

We use the institution to infer the place of work of the authors. We have:

1. The region (continent)
2. The sub-region (sub-continent)
3. Country 3-letter code
4. The institution's name

In [10]:
places = pd.read_csv("../data/processed/institution.csv")

places = places[['Handle', 'Primary-Name', 'alpha-3', 'region', 'sub-region']].set_index("Handle")

### Adding place of work to people

In [11]:
all_people = pd.merge(all_people,
                  places,
                  left_on="Workplace-Institution",
                  right_index=True,
                  how="left")

# all_people = all_people[all_people.region.notna()]

### Adding gender to the citation table

We have two genders for each cite:

1. Gender of the source (`gender_s`)
3. Gender of the target (`gender`)

In [12]:
cite = pd.merge(cite,
                all_people[["Short-Id", "gender"]],
                how="left",
                left_on="target",
                right_on="Short-Id")

cite = pd.merge(cite,
                all_people[["Short-Id", "gender"]].rename(columns={"gender":"gender_s"}),
                how="left",
                left_on="source",
                right_on="Short-Id")

Let's remove the citations without the gender of the target from the table.

In [13]:
cite = cite[cite.gender.notna()]
cite = cite[cite.gender_s.notna()]

## Super-cited researchers

Let's get some basic statistics of the super-cited researchers in our citation network.

In [14]:
G_cite = nx.from_pandas_edgelist(cite,
                            source='source',
                            target='target',
                            create_using=nx.DiGraph)

In [15]:
degree = pd.DataFrame(G_cite.in_degree(), columns=["author", "degree"])
mu = degree.degree.mean()
r = degree.degree.quantile(.75) - degree.degree.quantile(.25)

In [16]:
super_cited = degree[degree.degree >= mu + 1.5 * r].author.unique()
cite_sc = cite[cite.target.isin(super_cited)]

## Collaboration network

In [100]:
col = pd.read_csv("../data/processed/co_author.csv")

col = col[(col.year >= 1990) & (col.year < 2020)]

col = col.drop_duplicates(subset=['author1', 'author2'])

## Add gender to collaboration network

In [101]:
col = pd.merge(col,
               all_people[['Short-Id', 'gender']],
               left_on='author1',
               right_on='Short-Id',
               how='left')
col = pd.merge(col,
               all_people[['Short-Id', 'gender']],
               left_on='author2',
               right_on='Short-Id',
               suffixes=["_1", "_2"],
               how='left')

In [102]:
col = col.dropna(subset=['gender_1', 'gender_2'])

# Super cited by year

In [23]:
years = []
super_cited = []
for year in [2000, 2003] + list(range(2005, 2020)):
    years.append(year)
    if year == 2000:
        chunk = cite[cite.s_year <= year]
    elif year == 2003:
        chunk = cite[(cite.s_year > 2000) & (cite.s_year <= 2003)]
    elif year == 2005:
        chunk = cite[(cite.s_year > 2003) & (cite.s_year <= 2005)]
    else:
        chunk = cite[cite.s_year == year]
    G_year = nx.from_pandas_edgelist(chunk,
                                     source="source",
                                     target="target",
                                     create_using=nx.DiGraph)
    degree = pd.DataFrame(G_year.in_degree(), columns=["author", "degree"])
    mu = degree.degree.mean()
    r = degree.degree.quantile(.75) - degree.degree.quantile(.25)
    scited = degree[degree.degree >= mu + 1.5 * r].author.unique()
    super_cited.append(set(scited))

## Neighbors by year

In [41]:
neighbors = []
succ_neighbors = []
for i, year in enumerate(years[:-5]):
    if year == 2000:
        chunk = col[col.year <= year]
    elif year == 2003:
        chunk = col[(col.year > 2000) & (col.year <= 2003)]
    elif year == 2005:
        chunk = col[(col.year > 2003) & (col.year <= 2005)]
    else:
        chunk = col[col.year == year]
    G_year = nx.from_pandas_edgelist(chunk,
                                     source="author1",
                                     target="author2",
                                     create_using=nx.Graph)
    n = []
    for sc in super_cited[i]:
        if sc in G_year:
            n.extend(list(G_year[sc]))
    n = set(n)
    if i == 0:
        n = n - super_cited[i]
    else:
        for j in range(0, i+1):
            n = n - super_cited[j]
    neighbors.append(n)
    succ_neighbors.append(n & super_cited[i+5])

# Networks by year

In [105]:
def order_authors1(x):
    return sorted(x)[0]

In [94]:
def order_authors2(x):
    return sorted(x)[1]

In [106]:
author1 = col[['author1', 'author2']].apply(order_authors1, axis=1).values

In [107]:
author2 = col[['author1', 'author2']].apply(order_authors2, axis=1).values

In [108]:
col['author1'] = author1

In [109]:
col['author2'] = author2

In [112]:
graphs = []
for i, year in enumerate(years[:-5]):
    chunk = col[col.year==year]
    chunk = chunk.groupby(['author1', 'author2']).size().rename('weight').reset_index()
    G = nx.from_pandas_edgelist(chunk,
                                source='author1',
                                target='author2',
                                edge_attr='weight')
    graphs.append(G)

In [118]:
len(graphs[4])

4322

## Structural parameters

Induced subgraph (node and SC neighbors)
 1. Clustering
 2. Degree
 3. Weighted degree
 
First neighbors
 1. Nodes
 2. Edges
 3. Weight