In [None]:
import yaml
import json
import os

import sqlalchemy as sql
import pandas as pd
import plotly.express as px
import pm_query as pq

from Bio import Entrez

# Part 1

These are the initial parameters of the scraper module, with the email address of the team member who created the module. The function pq.secret_manager reads a yaml file containing the passwords and api keys necessary for running this module without hardcoding them into the python script. The search term 'HIV' is assigned the variable name 'search'.

In [None]:
keys = pq.secret_manager("apikeys.yaml")

email = "rachit.sabharwal@uth.tmc.edu"
search = "HIV"

# {DON'T RUN THIS}

In this section, we gather the data for the final data frame. The get_pmid function queries the eSearch endpoint of the Entrez api to retrieve the corresponding pmids and join them to the input dataframe. Using the pmids retrieved in the get_pmids function, the get_data function queries the eFetch endpoint to retrieve the details for the corresponding citation as a list of dictionaries. The data gathered is then converted from a python dictionary into a JSON-encoded object and saved as hiv_records.json 

In [None]:
hiv_pmids = pq.get_pmid(contact=email, key=keys["apikeys"]["ncbikey"]["key"], term=search, mindate="2020/01/01", maxdate="2020/09/01")

hiv_records = pq.get_data(pmid_list=hiv_pmids, contact=email, key=keys["apikeys"]["ncbikey"]["key"])

with open('hiv_records.json', 'w') as outfile:
    json.dump(hiv_records, outfile)

# {/DON'T RUN THIS}

In this section, we clean the data by executing the clean_data and keep_cleaning functions. The keep_cleaning function performs additional cleaning on the data by resetting the index of the dataframe, converting the pmid variable to an integer data type and formatting the dates into the %Y-%m-%d’ format and the columns for title and abstract are joined by index. Finally, the information from the dataframe is converted into csv format.

In [None]:
with open('D:\Dell_Desktop\Documents\Python Projects\ph_1975_capstone_project\webapp\hiv_records.json', 'r') as outfile:
    hiv_records = json.load(outfile)

hiv_clean = pq.clean_data(hiv_records)
hiv_clean = pq.keep_cleaning(hiv_clean)

pq.file_downloader("hiv_csv_clean.csv", hiv_clean)

# Part 2

In this section, we read the csv file created by the data crawler and reads it using the pandas read_csv function. This data is then reformatted for use with sqlite and saved as a new csv file called hiv_csv. 

In [None]:
hiv_csv = pq.csv_bnb("hiv_records_clean.csv")

In this section, we use the sqlite_out function to take the file hiv_csv and use the create engine function included in sqlalchemy to automatically build a database from the aforementioned file, specifying sqlite as the database dialect. We then use a similar create engine function for the author query, restricting results to those with a similar author name using the pandas read_sql function. Finally, we display the first 10 results from this query using the head function. 

In [None]:
pq.sqlite_out(hiv_csv)

sql_df = pq.sql_author_query("Julie")
sql_df.head()

# Part 3

In this section, we call on the function that creates graphs where the user can display number of publications in each month as a bar graph, visualize the trend of the publications over time as a line graph or view both simultaneously as the line graph overlays the bar graph. 

In [None]:
pq.draw_graph(hiv_csv)

This section creates and displays the summary statistics by month. 

In [None]:
summary_stats = pq.summary_stats(hiv_csv, "january")
summary_stats