This notebook is a narrative exploration of efforts to predict the number of citations per year a paper will receive,
based on data available at time of publication.

A full exploration of 25 different prediction models can be found [here](askfhs)

In this notebook I look to predict the number of citations a paper will received based upon
* the words used in the abstract
* physics-inspired semantic metrics (implemented in github.com/zhafen/cc and employed in Imel & Hafen in prep)
* metadata (publishing journal, number of authors, number of pages, etc.)

The raw code for this analysis can be [found here](https://github.com/zhafen/work-sample).

I aimed to keep this work sample clean, so please reach out if you have questions about details I have not included.

# Data

I use publication abstracts and metadata pulled from the [NASA astrophysics data sytem](https://ui.adsabs.harvard.edu) via [the official API](https://ui.adsabs.harvard.edu/help/api/). The analyzed publications are from a randomly-chosen physics or astrophysics specialization.

I externally preprocessed the abstract data with natural language processing (including tokenizing, stemming, and removing filler words), and each abstract has a corresponding bag-of-words representation.

# Cross Validation

In [282]:
from sklearn.model_selection import cross_validate

In [None]:
cross_validate(
    estimator = model,
    X = x_t_df.values,
    y = df_train['log_citations_per_year'].values,
    cv = kfold,
    scoring = 'neg_mean_squared_error',
    return_estimator = True,
)

{'fit_time': array([0.0011487 , 0.00084901, 0.00049782, 0.000458  , 0.00043535]),
 'score_time': array([0.00037026, 0.00032496, 0.00028419, 0.00027323, 0.00028801]),
 'estimator': [Baseline(), Baseline(), Baseline(), Baseline(), Baseline()],
 'test_score': array([-0.42891344, -0.42412764, -0.38014943, -0.41732149, -0.44537275])}

# Credits

Utilized python packages include:
* [ads](https://github.com/andycasey/ads)
* nltk