# 1. IMPORT DATA

### This notebook will import selected data for use in the project. You have the options to filter, randomly sample from, and de-duplicate data. Data must be in required format, meaning a zipped folder of json files with required we1s metadata fields. If you have data in plain-text form, or HTML files of ProQuest search results, begin with the appropriate `aux_` notebook for your data type.

## SETTINGS

In [None]:
# import modules
import csv
import glob
import os
import shutil
import re

# import global project settings from config.py
from settings import *

# set jupyter_root and project directory
jupyter_root = "/home/jovyan"
project_dir = %pwd
print(project_dir)

## BROWSE: search zip filenames for keywords

Choose `search_text` to filter available data files. If you are searching for a specific word or phrase, enter it WITHIN the single quotes below. Note that you will be searching the filenames of the data zip folders stored on harbor (usually in the `data` directory). If you want to simply list all of the files in a specific data directory, change the value of the `search_text` variable below to `None` WITHOUT single quotes (so the line should read `search_text=None`).

In [None]:
search_text='search-text-here'

Run the cell and review the results. The default is to search through the `data/data-new/` directory. 
If your data is in a different location on harbor, change the `data_directory` variable to the directory you want to search, making sure to KEEP the slash at the end of the directory name.

In [None]:
import os
data_directory = 'data/data-new/'
filespath = jupyter_root + '/' + data_directory
print("datafile_list = [")
if search_text:
    for (dirname, _dirs, files) in os.walk(filespath):
        for filename in files:
            if filename.endswith('.zip') and search_text in filename:
                filepath = os.path.join(dirname.split(filespath)[1], filename)
                print("    '" + filepath + "',")
else:
    for (dirname, _dirs, files) in os.walk(filespath):
        for filename in files:
            if filename.endswith('.zip'):
                filepath = os.path.join(dirname.split(filespath)[1], filename)
                print("    '" + filepath + "',")
print("                 ]")

## LIST: define which zips will be used to import JSON files

To only import the zip files you found above, copy the entire cell output above and replace the datafile_list array in the following cell. Each filename should be surrounded by single quotes, and after each filename there should be a comma (for the last filename in the list it doesn't matter if you include the commor not). Then run the cell. 

In [None]:
datafile_list = ['164282_deseretmorningnewssaltlakecity_bodypluralhumanitiesorhleadpluralhumanities_2017-01-01_2017-12-31.zip',
'6742_thenewyorktimes_bodypluralhumanitiesorhleadpluralhumanities_1980-01-01_1980-12-31.zip',
'164282_deseretmorningnews_bodypluralhumanitiesorhleadpluralhumanities_2017-01-01_2017-12-31.zip',
'300814_theforward_bodypluralhumanitiesorhleadpluralhumanities_2017-01-01_2017-12-31.zip',
'438278_thefreepressfernie_bodypluralhumanitiesorhleadpluralhumanities_2017-01-01_2017-12-31.zip']


## IMPORT: copy JSON from zip files to project cache

JSON files will be stored in the `/caches/` project directory. Original zip source data remains untouched. Remember to define your `data_directory` again below.

In [None]:
%%time 

data_directory = 'data/data-new'
filespath = jupyter_root + '/' + data_directory

!rm -r caches/json
!mkdir -p caches/json

for datafile in datafile_list:
    datapath = filespath + '/' + datafile
    !unzip -j -o -u "{datapath}" "*.json" -d caches/json > /dev/null

!ls caches/json | wc -l
    
print('\n\n----------Time----------')

## FILTER: delete non-matching JSON

If you want to filter out any articles that do not contain a required keyword or phrase -- e.g. 'humanities' -- then write word here, between the single quotes in the cell below:

In [None]:
required_phrase = ''

Run the filter to delete JSON files that do not match. If no filter is defined, this step will be skipped.

In [None]:
%%time

import os, re, json

if required_phrase:
    
    json_directory = 'caches/json/'
    sorted_json = sorted(f for f in os.listdir(json_directory) if f.endswith(".json"))

    del_count = 0
    for filename in sorted_json:
        fpath = os.path.join(json_directory, filename)
        scrub_changed = False
        with open(fpath) as f:
            # json_decoded = json.load(json_file)
            json_decoded = json.loads(f.read())
            json_content = json_decoded['content']
            if not re.search(required_phrase, json_content, re.IGNORECASE):
                os.remove(os.path.join(json_directory, filename))
                del_count += 1
                if(del_count%10==0):
                    print('. ', end='')
    new_num_docs = len(os.listdir(json_directory))
    print('Number of documents deleted: ' + str(del_count))
    print('Number of documents containing "' + required_phrase + '": ' + str(new_num_docs))
else:
    print('No required phrase, no documents deleted.')


print('\n\n----------Time----------')

## RANDOM SAMPLE: Select x number of articles to analyze from imported files

Use the following cells if you want to randomly sample the total number of json files imported (and/or filtered) so that you only select a sample for analysis. Set the `selection` variable to the number of articles you want to randomly sample. Skip the cells in this section if you want to include all of the imported articles in your analysis.

If your data contains duplicates, and you choose to sample your data before detecting and deleting duplicates (see below), please note that some duplicates may end up in your random sample. You can choose to delete or to keep them in the cells under the "DE-DUPLICATE" heading, but if you choose to delete them, you may eliminate articles from your sample, causing the total number of articles in your sample to go down. To avoid this, you can choose to run the "DE-DUPLICATE" cells *before* running the cells under "RANDOM SAMPLE," but this means you will be running your de-duplication detection against your entire imported dataset. If you have imported lots of articles, de-duplication can take awhile.

In [None]:
import random
selection = 1200 #change this number to the number of articles you want to randomly sample. 
# Do not use commas in the number.

json_list = os.listdir("caches/json")
sample = random.sample(json_list, selection)

# preview sample
print(sample[:10])

Write the contents of random selection to `caches/json_sample` directory.

In [None]:
import pyfastcopy

!rm -r caches/json_sample
!mkdir -p caches/json_sample

for item in sample:
    filepath = "caches/json/" + item
#     !mv '{filepath}' caches/json_sample
    shutil.copy(filepath, 'caches/json_sample/')
    
!ls caches/json_sample | wc -l

Move contents of `caches/json_sample` to new `caches/json` directory.

In [None]:
!rm -r caches/json
!mkdir caches/json
!mv caches/json_sample/* caches/json
!rm -r caches/json_sample
!ls caches/json | wc -l

### If you are dealing with data COLLECTED from LexisNexis AFTER 2.10.19, you do not need to run SCRUB.

## SCRUB: add scrubbed content to JSON

Scrubbing is performed on each article JSON file, and the results are stored in a new key in the JSON file.

-  To perform, set this step to True.
-  If an article is already scrubbed it will be skipped unless rescrub is True.
-  To reduce the JSON cache size, set delete original content. If original content is deleted then scrubbing cannot be repeated without re-exporting JSON from zip above.

In [None]:
do_scrub = True
do_scrub_rescrub = False
do_scrub_delete_original_content = True 

Run to scrub.

In [None]:
%%time

import json
from scripts.scrub.scrub import scrub

if do_scrub:

    json_directory = 'caches/json/'
    sorted_json = sorted(f for f in os.listdir(json_directory) if f.endswith(".json"))

    scrub_count = 0
    for filename in sorted_json:
        fpath = os.path.join(json_directory, filename)
        scrub_changed = False
        with open(fpath) as f:
            # json_decoded = json.load(json_file)
            json_decoded = json.loads(f.read())
            if 'content' in json_decoded and (not 'content-unscrubbed' in json_decoded or do_scrub_rescrub):
                json_decoded['content-unscrubbed'] = json_decoded['content']
                json_decoded['content'] = scrub(json_decoded['content'])
                scrub_changed = True
#             if 'content' in json_decoded and (not 'content_scrubbed' in json_decoded or do_scrub_rescrub):
#                 json_decoded['content_scrubbed'] = scrub(json_decoded['content'])
#                 scrub_changed = True
            if do_scrub_delete_original_content and 'content-unscrubbed' in json_decoded and 'content' in json_decoded:
                json_decoded.pop('content-unscrubbed', None)
                scrub_changed = True
        if scrub_changed:
            with open(fpath, 'w') as json_file:
                json.dump(json_decoded, json_file)
            scrub_count += 1
            ## progress indicator
            if(scrub_count%100==0):
                print('. ', end='')
    print('Scrubbed ' + str(scrub_count) + ' files.')
else:
    print('Skipping scrub.')

print('\n\n----------Time----------')

## DE-DUPLICATE

Run the following cells if you want to detect duplicates in imported data and delete them. Right now, de-duplication fails on collections of data greater than ~22,000 articles.

In [None]:
do_dedupe = True

In [None]:
## DE-DUPLICATE

## For help on script options:
## %run scripts/deduplicate/corpus_compare.py -h 

if do_dedupe:

    print(project_dir)
    print(dedup_dir)
    print(dedup_name)
    
    ## delete previous results
    !rm -f {dedup_dir}/{dedup_output}.csv
    !rm -f {dedup_dir}/{dedup_output}.log
    !rm -f {dedup_output}.log

    %run {dedup_dir}/{dedup} -i caches/json/ -f *.json --threshold 0.8 -o {dedup_dir}/{dedup_name}.csv -l {dedup_dir}/{dedup_name}.log

## --------------
## FOR DockerFile
## --------------
## relies on sklearn
## need to pip install or pip2 install or conda install scikit-learn?

else:
    print('Skipping de-deuplicate')



Delete detected duplicates. Do not run if you don't want to delete duplicates.

In [None]:
## MERGE METADATA
import os
import csv

csv.field_size_limit(100000000)

if do_dedupe:
    with open(project_dir + '/' + dedup_dir + '/' + dedup_name + '.csv','r') as fin:
        cfin = csv.reader(fin)
        # print(cfin, None)
        next(cfin) # skip header
        for row in cfin:
            if os.path.isfile(row[5]):
                print('Deleting: ' + row[5])
                os.remove(row[5])
            else:
                print('Missing:  '+ row[5])
    print('\n-----\nDuplicates deleted from:', dedup_dir + '/' + dedup_name + '.csv')

else:
    print('Skipping de-deuplicate')

## NEXT NOTEBOOK

In [None]:
write_project_dir = project_dir.replace('/home/jovyan/', '')
next_link = 'http://harbor.english.ucsb.edu:10000/notebooks/' + write_project_dir + '/2_topic_model_data.ipynb'

from IPython.display import display, HTML
next_link_html = HTML('<h2>Next:</h2><p>Go to <a href="' + next_link + '" target="_blank"><strong>Notebook 2</a></strong> to model your data.</p>')
display(next_link_html)