# Counting and Visualizing Document Totals by Year and Source within the Project

### INFO

__author__    = 'Lindsay Thomas'  
__copyright__ = 'copyright 2019, The WE1S Project'  
__license__   = 'MIT'  
__version__   = '2.0'</p> 
__email__     = 'lindsaythomas@miami.edu'

This notebook counts the number of documents per unique source per year in the project. It offers two different methods of counting, and is organized into numbered sections. Results may vary depending on which counting method you use. If you are using dfr-browser's metadata file to do your counting, you must have already produced a dfr-browser for your model. You can configure this setting in **Section 2**. 

The notebook also includes a few options for saving and visualizing these count totals.

You must run every cell under the **Settings** section of this notebook first. Every time you return to this notebook, you must run these cells first.

## Settings

In [None]:
# Python imports
import os
import re
from pathlib import Path
from IPython.display import display, HTML

# Define paths
current_dir     = %pwd
current_pathobj = Path(current_dir)
project_dir     = str(current_pathobj.parent.parent)
current_reldir  = current_dir.split("/write/")[1]
data_dir        = project_dir + '/project_data'
json_dir        = project_dir + '/project_data/json'
json_test       = project_dir + '/project_data/json_test'
model_dir       = data_dir + '/models'
md_file         = project_dir + '/project_data/metadata/metadata-dfrb.csv'

# Import scripts
%run scripts/count_docs.py

# Helper script
def clean_date_range(date_range):
    """Strip spaces from date ranges."""
    return re.sub('\s+', '', date_range)

# Display the project directory
display(HTML('<p style="color: green;">Setup complete.</p>'))

## 1. Load an Existing Dataframe 

If you have already created and saved a dataframe using this notebook, you can load it for use in this notebook using the cells below. After loading your dataframe, you can skip to **Section 3**. If you need to create a counts dataframe, continue to **Section 2**. 

In [None]:
# Define path of saved dataframe
csv_file = ''

In [None]:
df = pd.read_csv(csv_file)

## 2. Count Documents by Source and by Publication Year

Make a dataframe with document totals for each unique source by year and download the results. You can obtain source information from EITHER dfr-browser's metadata file (you must have produced a dfr-browser in your project in order to use this method) OR the json documents in your project. Obtaining source information from dfr-browser's metadata file is generally quicker, particularly for large collections of data.

Dates marked as 'unknown' are unknown. To discover the total number of documents listed as having an 'unknown' publication year, scroll down to section 4 of the notebook.

### Configuration

Please select the way you would like to do the counting and specify the names of the `title` and `pub_date` fields in your data. To change how you would like to do the counting, change the value of the `mode` variable to `'dfr-browser'` or `'json'`, depending on which one you want. If you have imported your data using the WE1S import notebook, you do not need to change the values for the `title_field` or `date_field` variables (although if you are using WE1S data, see the next paragraph). Otherwise, change these values to the metadata fields where you would like the code to look for your publication titles and your publication dates. The values must be enclosed in quotation marks.

**If you are working with WE1S data,** we recommend you use `title_field` = `'source'` for the most accurate count of unique sources (you can use either `'pub_date'` OR `'pub_year'` for the `date_field` variable).

In [None]:
## mode must be `dfr-browser` or `json`
# mode = 'dfr-browser'
mode = 'json'

# Only change the below values if your title and publication dates
# are located in fields with different names in your data.
title_field = 'source'
date_field = 'pub_date'

display(HTML('<p style="color: green;">Configuration complete.</p>'))

### Create Dataframe of Counts for Sources and Dates

In [None]:
df = source_count_by_year(mode, md_file, json_dir, title_field, date_field)

display(HTML('<p style="color: green;">Dataframe created.</p>'))

### View the Dataframe

The cell below uses a <a href="https://github.com/quantopian/qgrid" target="_blank">QGrid</a> widget to display count results in a dataframe. Click a column label to sort by that column. Click it again to reverse sort. Click the filter icon to the right of the column label to apply filters (for instance, reducing the table to only documents from specific sources). You can re-order the columns by dragging the column label.

In [None]:
# sort by highest totals, print in descending order
df = df.sort_values('Total', ascending=False)

qgrid_widget = qgrid.show_grid(df, grid_options=grid_options, show_toolbar=False)

qgrid_widget

### Save the Dataframe to a CSV File

The cell below will save the version of the dataframe you see displayed in the cell above. To save the full version of the dataframe (disregarding any filtering, etc you have done in the QGrid dataframe), skip the next cell, uncomment the code in the cell below it, and run that cell. 

Either cell will create a csv file named `source_counts_by_year.csv` in this module directory which you can download and save to your computer for further processing and visualization (using Excel, Google sheets, etc).

In [None]:
# Save version of dataframe you see above to csv
changed_df = qgrid_widget.get_changed_df()

changed_df.to_csv('source_counts_by_year_delete.csv', index_label = 'Sources')
display(HTML('<p style="color: green;">CSV file saved.</p>'))

In [None]:
## Save original dataframe to csv, disregarding any changes you made in qgrid

# df.to_csv('source_counts_by_year.csv', index_label = 'Sources')
# display(HTML('<p style="color: green;">CSV file saved.</p>'))

## 3. Further Explore and Visualize Results

You must have completed **Section 1** or **Section 2** to run the cells in this section.

### Create a Dataframe of the Top 10 Sources in Selected or a Range of Years

In the cell below, configure a list of `years` in the form `['2014', '2015', '2017']`, or you can provide a date range by setting `date_range` to a hyphen-separated range like `'2014-2017'`. If you wish to use a list of years, rather than a date range, set `date_range = None`.

Run the second cell below to view the dataframe. 

In [None]:
# Configure selected years or a range of years
# years = ['2014', '2015', '2017']
date_range = '2014-2016' # Change to None if not using

display(HTML('<p style="color: green;">Dataframe configuration complete.</p>'))

In [None]:
# View the dataframe
df_top10 = df.iloc[:10]

if date_range:
    date_range = clean_date_range(date_range)
    df_top10 = df_top10.loc[:,date_range.split('-')[0]:date_range.split('-')[1]] 
else:
    df_top10 = df_top10.loc[:,years]

df_top10

### Plot the Dataframe Using Subplots (Optional)

This cell will display a separate plot for each unique source. To save the figure for downloading, uncomment the line `plt.savefig('top10.png')` (you can change the name of the output file if you like).

In [None]:
# Plot the above dataframe using subplots
%matplotlib inline
import matplotlib.pyplot as plt

df_top10.transpose().plot(kind='bar', figsize=(15,10), subplots=True, title=['','','','','','','','','',''])

# Save the plot
# plt.savefig('top10.png')

### Get the Number of Documents for a Source and the Number of Documents/Year for That Source

In the cell below, provide the name of a publication source. The `source` value should be one of the sources listed in the first column of the dataframe you loaded in **Section 1** or created in **Section 2**.

Configure a list of `years` in the form `['2014', '2015', '2017']`, or you can provide a date range by setting `date_range` to a hyphen-separated range like `'2014-2017'`. If you wish to use a list of years, rather than a date range, set `date_range = None`.

Run the second cell below to create the dataframe and view the total number of documents found for the specified source within the specified dates.

In [None]:
# Configure source and either selected years or a range of years
source = ''
# specific_years = ['2014','2016']
date_range = '2014-2016' # Use something like '2014-1016' or set to None if not required

display(HTML('<p style="color: green;">Dataframe configuration complete.</p>'))

In [None]:
# View dataframe
try:
    df_total = df.loc[source, 'Total']
    display(HTML('<p>Number of Documents for <code>' + source + '</code>: ' + str(df_total) + '</p>'))
    if date_range:
        df_singlesource = df.loc[source,date_range.split('-')[0]:date_range.split('-')[1]]
    else:
        df_singlesource = df.loc[source,specific_years]
    pd.DataFrame(df_singlesource)
except KeyError:
     display(HTML('<p style="color:#FF0000";>That source title does not exist in dataframe. Check `title_field` variable above.</p>'))

### Plot the Number of Documents/Year for a Source (Optional)

This cell will display a plot of the total number of documents per year for a given source. To save the figure for downloading, uncomment the line `plt.savefig('singlesource.png')` (you can change the name of the output file if you like).

In [None]:
# Plot the dataframe
%matplotlib inline
import matplotlib.pyplot as plt

plot_title = 'Plot Title'
# year1 = 
# year2 = 

df_singlesource.transpose().plot(kind='bar', color=(0.2, 0.4, 0.6, 0.6), title=plot_title)

# Save the plot
# plt.savefig('singlesource.png')

### Count the Total Number of Documents for a Given Year or Years


In the cell below, configure a list of `years` in the form `['2014', '2015', '2017']`, or you can provide a date range by setting `date_range` to a hyphen-separated range like `'2014-2017'`. If you wish to use a list of years, rather than a date range, set `date_range = None`.

Run the second cell below to view the dataframe. 

In [None]:
# Configure selected years or a range of years
# specific_years = ['2014']
date_range = '2014-2016' # Use something like '2014-1016' or set to None if not required

display(HTML('<p style="color: green;">Dataframe configuration complete.</p>'))

In [None]:
# View the dataframe
if date_range:
    date_range = clean_date_range(date_range)
    year_total = df.loc[:,date_range.split('-')[0]:date_range.split('-')[1]].sum()
else:
    year_total = df.loc[:,specific_years].sum()

pd.DataFrame(year_total, columns=['Total'])

## 4. Count Total Number of Documents without a Publication Date

### Count Number of Documents without a Publication Date

This cell counts the number of documents without a publication date that appear in the dataframe created in **Section 1**.

In [None]:
# Display documents without a publication date
try:
    count1 = df['1900'].sum()
except KeyError as err:
    count1 = 0
try:
    count2 = df['unknown'].sum()
except KeyError as err:
    count2 = 0
try:
    count3 = df['NaN'].sum()
except KeyError as err:
    count3 = 0

display(HTML('<p>' + str(count1) + ' dates listed as <code>1900</code></p>'))
display(HTML('<p>' + str(count2) + ' dates listed as <code>unknown</code></p>'))
display(HTML('<p>' + str(count3) + ' dates listed as <code>NaN</code></p>'))

## 5. Count Documents by Metadata Field

You must have run all of the cells under **Settings** to run the code in this section.

The code below counts the number of documents associated with a specific metadata field in your project. You must provide the field you want to count. The `json_utilities` module includes methods for retrieving lists of all of the metadata fields in your files. You can count the total number of documents with any value for a certain field, or you can count the number of documents that have a certain value within a certain field. The examples given below apply to WE1S data.

### Count the Number of Documents Associated with All Values for a Specific Field

This cell allows you to count the total number of documents associated with all values for a specific field: for example, all of the documents associated with each value in the `tags` field. Configure the `tags` field in the cell below and then run the following cell.

In [None]:
# Enter the name of the field you want to count here
field = 'tags'

display(HTML('<p style="color: green;">Field configured.</p>'))

In [None]:
bad_jsons, no_field, df = docs_by_field(json_dir, field)
warnings = []
if len(bad_jsons) > 0:
    warnings.apppend(str(len(bad_jsons)) + ' documents failed to load correctly and were not included in the count totals.')
if len(no_field) > 0:
    warnings.apppend(str(len(no_field)) + ' documents do not contain the selected field.')
for msg in warnings:
    display(HTML('<p style="color: red;">' + msg + ' If the number is large, this will significantly affect your results.</p>'))
display(HTML('<p style="color: green;">Dataframe created. Run the cell below to view it.</p>'))

### View the Dataframe

The below cell uses a <a href="https://github.com/quantopian/qgrid" target="_blank">QGrid</a> widget to display count results in a dataframe. Click a column label to sort by that column. Click it again to reverse sort. Click the filter icon to the right of the column label to apply filters (for instance, reducing the table to only documents with a particular value in your chosen metadata field). You can re-order the columns by dragging the column label.

In [None]:
qgrid_widget = qgrid.show_grid(df, grid_options=grid_options, show_toolbar=False)
qgrid_widget

### Save the Dataframe to a CSV File

The cell below will save the version of the dataframe you see displayed in the cell above. To save the full version of the dataframe (disregarding any filtering, etc you have done in the QGrid dataframe), skip the next cell, uncomment the code in the cell below it, and run that cell.

Either cell will create a csv file named `YOURFIELD_counts.csv` in this module directory which you can download and save to your computer for further processing and visualization (using Excel, Google sheets, etc).

In [None]:
# Save dataframe to csv
changed_df = qgrid_widget.get_changed_df()

csv_file = field + '_counts.csv'

changed_df.to_csv(csv_file, index_label = 'Index')
display(HTML('<p style="color: green;">Dataframe saved as <code>' + csv_file + '</code>.</p>'))

In [None]:
## Save original dataframe to csv, disregarding any changes you made in qgrid

# csv_file = field + '_counts.csv'

# df.to_csv('source_counts_by_year.csv', index_label = 'Sources')
display(HTML('<p style="color: green;">Dataframe saved as <code>' + csv_file + '</code>.</p>'))

### Count the Number of Documents with a Specific Value for a Specific Field

This cell counts the number of documents with a specific value for a specific field, for example, the number of documents tagged with 'education/funding/US private college' in your project. Configure the `field` and `target_value` variables below and then run the following cell to view the counts.

In [None]:
# Configure the field and target value
field = 'tags'
target_value = 'education/funding/US private college'

display(HTML('<p style="color: green;">Field and target value configured.</p>'))

In [None]:
# View the counts
value_count = specific_value_count(json_dir, field, target_value)

display(HTML('<p>' + str(value_count) + '  documents with <code>' + target_value + '</code> value in project.</p>'))

### Calculate the Proportion of Documents with a Specific Value for a Specific Field

You can understand the number produced by the cell above in the context of the rest of the project by running the following code, which calculates the proportion of documents in your project that have your target value for your target field.

In [None]:
json_dir_files = [file for file in os.listdir(json_dir) if file.endswith('.json')]
json_length = len(json_dir_files)

num_docs = str(json_length)
num_matches = str(value_count)
proportion = value_count/json_length
proportion = str(proportion*100) + '%'

out = num_matches + ' of ' + num_docs + ' documents (' + proportion + ') have the value <code>' + target_value + '</code> '
out += 'for the field <code>' + field + '</code>.'
display(HTML('<p>' + out + '</p>'))