# Faire des statistiques


In [7]:
import pandas as pd
df = pd.read_csv("../../data/CSS_exact_openalex.csv", low_memory=False)
df.shape

(1449, 182)

Deux variables :
- dépendante : nombre de citations `cited_by_count`
- indépendantes : années `publication_year` ; langue `language` ; nombre auteurs `authorships.raw_author_name`, open accès `open_access.is_oa`, type `type`

## Traitement des données

In [9]:
# enlever les lignes sans auteurs
df = df[~df["authorships.raw_author_name"].isna()]

# ajouter le nombre d'auteurs
df["nb_auteurs"] = df["authorships.raw_author_name"].apply(lambda x: len(x.split("|")))

# langue anglais
df["english"] = df["language"] == "en"

# renommer les colonnes
df = df.rename(columns={"open_access.is_oa":"openaccess"})

# numérique pour des variables
df["openaccess"] = df["openaccess"].astype(int)

In [11]:
df[["nb_auteurs","openaccess","english"]]

Unnamed: 0,nb_auteurs,openaccess,english
0,15,1,True
1,14,1,True
2,15,0,True
3,4,1,True
4,12,0,True
...,...,...,...
1441,1,1,True
1442,1,0,True
1443,1,0,True
1445,1,0,True


## Univarié

Indicateurs

In [12]:
df["nb_auteurs"].describe()

count    1378.000000
mean        3.174165
std         4.618503
min         1.000000
25%         1.000000
50%         2.000000
75%         4.000000
max        99.000000
Name: nb_auteurs, dtype: float64

Distribution

In [14]:
df["english"].value_counts()

english
True     1312
False      66
Name: count, dtype: int64

recodage

## Bivarié

Quanti/quanti .corr

In [None]:
df["nb_auteurs"],df["cited_by_count"]

In [21]:
from scipy.stats import pearsonr

stat, p = pearsonr(df["nb_auteurs"],df["cited_by_count"])

Quanti/quali

.groupby

In [22]:
df.groupby("openaccess")["cited_by_count"].mean()

openaccess
0    14.269118
1    32.467049
Name: cited_by_count, dtype: float64

In [26]:
from scipy.stats import f_oneway

f_oneway(df[df["openaccess"] == 1]["cited_by_count"],
         df[df["openaccess"] == 0]["cited_by_count"])

F_onewayResult(statistic=np.float64(4.2765330600487035), pvalue=np.float64(0.03882860979861467))

Possibilité d'utiliser d'autres bibliothèques, comme pinguin https://pingouin-stats.org/build/html/index.html

In [None]:
#pip install pingouin

Quali/Quali

In [47]:
tab = pd.crosstab(df["openaccess"], df["english"])
pd.crosstab(df["openaccess"], df["english"], normalize="columns")

english,False,True
openaccess,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.757576,0.480183
1,0.242424,0.519817


In [42]:
from scipy.stats import chi2_contingency

stat, p, *args = chi2_contingency(tab)
p

np.float64(1.9367915463731307e-05)

## Modèles

### Régression linéaire

In [48]:
import statsmodels.api as sm


X = df[['publication_year', 'nb_auteurs']] 
X = sm.add_constant(X) 
y = df["cited_by_count"]

model = sm.OLS(y, X).fit()


print(model.summary())

                            OLS Regression Results                            
Dep. Variable:         cited_by_count   R-squared:                       0.045
Model:                            OLS   Adj. R-squared:                  0.043
Method:                 Least Squares   F-statistic:                     32.13
Date:                Fri, 13 Jun 2025   Prob (F-statistic):           2.31e-14
Time:                        11:51:52   Log-Likelihood:                -8946.8
No. Observations:                1378   AIC:                         1.790e+04
Df Residuals:                    1375   BIC:                         1.792e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const             1.467e+04   1916.031  

#### Régression logistique

Quels sont les prédicteurs du fait d'avoir plus de 10 citations ?

Création d'une variable 0 (pas du tout ou rarement) et 1 sinon

In [49]:
df["citations10"] = df["cited_by_count"] > 10

Avec Scikit-learn

In [50]:
from sklearn.linear_model import LogisticRegression

In [52]:
X = pd.get_dummies(df[["publication_year", "openaccess"]], drop_first=True)
y = df["citations10"]

In [53]:
modele = LogisticRegression()
modele.fit(X, y)
modele.coef_

array([[-0.21478903,  1.30063159]])

In [55]:
modele.predict([[2004, 10]])



array([ True])

Avec Statsmodels

In [57]:
import statsmodels.api as sm
from patsy import dmatrices

In [58]:
y,X = dmatrices('citations10 ~ C(openaccess) + publication_year', 
                data = df, 
                return_type = 'dataframe')

In [59]:
modele = sm.Logit(y["citations10[True]"],X)
resultat = modele.fit()
resultat.summary()

Optimization terminated successfully.
         Current function value: 0.444281
         Iterations 7


0,1,2,3
Dep. Variable:,citations10[True],No. Observations:,1378.0
Model:,Logit,Df Residuals:,1375.0
Method:,MLE,Df Model:,2.0
Date:,"Fri, 13 Jun 2025",Pseudo R-squ.:,0.1398
Time:,11:56:22,Log-Likelihood:,-612.22
converged:,True,LL-Null:,-711.69
Covariance Type:,nonrobust,LLR p-value:,6.327999999999999e-44

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,434.8080,36.241,11.998,0.000,363.777,505.839
C(openaccess)[T.1],1.3349,0.162,8.231,0.000,1.017,1.653
publication_year,-0.2165,0.018,-12.038,0.000,-0.252,-0.181


Avec la version toute intégrée

In [None]:
#!pip install -U pyshs

In [60]:
import pyshs

In [61]:
pyshs.regression_logistique(df, "citations10", ["openaccess", "publication_year"],)

Unnamed: 0_level_0,Unnamed: 1_level_0,OR,p,IC 95%
Variable,Modalité,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
.Intercept,,0.0,0.0***,0.00 [0.00-0.00]
openaccess,numérique,0.26,0.0***,0.26 [0.19-0.36]
publication_year,numérique,1.24,0.0***,1.24 [1.20-1.29]
