<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
Modified by [Tyrica Terry Kapral](mailto:tyt3@pitt.edu).<br />
___

**Working with Dataset Files**

**Description:** This notebook describes how to:
* Read and write files (.txt, .csv, .json)
* Use the `tdm_client` to read in metadata
* Use the `tdm_client` to read in data

This notebook describes how to read and write text, CSV, and JSON files using Python. Additionally, it explains how the `tdm_client` can help users load and analyze their datasets.

**Difficulty:** Intermediate

**Knowledge Required:** 
* Python Basics ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:** None

**Data Format:** Text (.txt), CSV (.csv), JSON (.json), JSON Lines (.jsonl)

**Libraries/Packages Used:**
* `pandas` to read and write CSV files
* `json` to read and write JSON files
* `tdm_client` to retrieve and read data
* `datetime`
* `jsonlines`
* `punctuation`, `whitespace` from `string`
* `contractions`
* `stopwords` from `nltk.corpus`
__

# Files in Python
Working with files is an essential part of Python programming. When we execute code in Python, we manipulate data through the use of variables. When the program is closed, however, any data stored in those variables is erased. To save the information stored in variables, we must learn how to write it to a file.

At the same time, we may have notebooks for applying specific analyses, but we need to have a way to bring data into the notebook for analysis. Otherwise, we would have to type all the data into the program ourselves! Both reading-in from files and writing-out data out to files are important skills for data science and the digital humanities.

This section describes how to work with three kinds of common data files in Python:
* Plain Text Files (.txt)
* Comma-Separated Value files (.csv)
* Javascript Object Notation files (.json)

Each of these filetypes are in wide use in data science, digital humanities, and general programming. 

# Three Common Data File Types

## Plain Text Files (.txt)
A plain text file is one of the simplest kinds of computer files. Plain text files can be opened with a text editor like Notepad (Windows 10) or TextEdit (OS X). The file can contain only basic textual characters such as: letters, numbers, spaces, and line breaks. Plain text files do not contain styling such as: heading sizes, italic, bold, or specialized fonts. (To including styling in a text file, writers may use other file formats such as rich text format (.rtf) or markdown (.md).)

Plain text files (.txt) can be easily viewed and modified by humans by changing the text within. This is an important distinction from binary files such as images (.jpg), archives (.gzip), audio (.wav), or video (.mp4). If a binary file is opened with a text editor, the content will be largely unreadable.

## Comma-Separated Value Files (.csv)
A comma-separated value file is also a text file that can easily be modifed with a text editor. A CSV file is generally used to store data that fits in a series or table (like a list or spreadsheet). A spreadsheet application (like Microsoft Excel or Google Sheets) will allow you to view and edit a CSV data in the form of a table.

Each row of a CSV file represents a single row of a table. The values in a CSV are separated by commas (with no space between), but other delimiters can be chosen such as a tab or pipe (|). A tab-separated value file is called a TSV file (.tsv). Using tabs or pipes may be preferable if the data being stored contains commas (since this could make it confusing whether a comma is part of a single entry or a delimiter between entries).

### The text contents of a sample CSV file
```
Username,Login email,Identifier,First name,Last name
booker12,rachel@example.com,9012,Rachel,Booker
grey07,,2070,Laura,Grey
johnson81,,4081,Craig,Johnson
jenkins46,mary@example.com,9346,Mary,Jenkins
smith79,jamie@example.com,5079,Jamie,Smith
```
### The same CSV file represented in Google Sheets:

![CSV table view in Google Sheets](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/csv_in_sheets.png)

## JavaScript Object Notation (.json)
A Javascript Object Notation file is also a text file that can be modified with a text editor. A JSON file stores data in key/value pairs, very similar to a Python dictionary. One of the key benefits of JSON is its compactness which makes it ideal for exchanging data between web browsers and servers.

While smaller JSON files can be opened in a text editor, larger files can be difficult to read. Viewing and editing JSON is easier in specialized editors, available online at sites like: 

* [JSON Formatter](http://jsonformatter.org)
* [JSON Editor Online](https://jsoneditoronline.org/)

A JSON file has a nested structure, where smaller concepts are grouped under larger ones. Like extensible markup language (.xml), a JSON file can be checked to determine that it is valid (follows the proper format for JSON) and/or well-formed (follows the proper format defined in a specialized example, called a schema). 

### The text contents of a sample JSON file

```
{
    "firstName": "Julia",
    "lastName": "Smith",
    "gender": "woman",
    "age": 57,
    "address": {
        "streetAddress": "11434",
        "city": "Detroit",
        "state": "Mi",
        "postalCode": "48202"
    },
    "phoneNumbers": [
        { "type": "home", "number": "7383627627" }
    ]
}
```
### The same JSON file represented in JSON Editor Online
![An image of the JSON file showing the structure](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/json_editor.png)


# Opening, Reading, and Writing Text Files (.txt)

## Open the File

Before we can read or write to text file, we must open the file. Normally, when we open a file, a new window appears where we can see the contents. In Python, opening a file simply means create a *file object* that refers to the particular file. When the file has been opened, we can read or write to the file. Finally, we must close the file. Here are the three steps:

1. Use the open() function to create a file object
2. Use the .read(), .readlines(), or .write() method on the file object
3. Use the close() function to close the file object

Let's practice on `sample.txt`, a sample text file.

In [None]:
# Open the text file `sample.txt` creating
# a file object called `f`

f = open('data/sample.txt', 'r')

We have created a file object called `f`. The first argument (`'sample.txt'`) is a string containing the file name. You can see the sample.txt in the same directory as this lesson. If your file was called reports.txt, you would replace that argument with `'reports.txt'`. The second argument (`'r'`) determines that we are opening the file in "read" mode, where the file can be read but not modified. There are three main modes that can be specified:

|Argument|Mode Name|Description|
|---|---|---|
|'r'|read|Reads file but no writing allowed (protects file from modification)|
|'w'|write|Writes to file, overwriting any information in the file (saves over the current version of the file|
|'a'|append|Appends (or adds) information to the end of the file (new information is added below old information)|

## Read the File

### .read() method
Now that we have a file object `f` opened in "read mode," let's read the contents with the `.read()` method. We will create a variable called `file_contents` to hold the data that we are reading in.

In [None]:
# Create a variable called `file_contents`
# that will hold the result of using the
# .read() method on our file object
file_contents = f.read()
print(file_contents)

When we are finished with the file object, we must close it using the .close() method on it. It is very important to always close a file, otherwise your program may crash or create memory problems.

In [None]:
# Close the file by using the .close() method
# on the file object
f.close()

### .readlines() method

If a file is very large, we may want to read a single line at a time so as not to fill all of the available computer memory. To read a single line at a time, we can use the `.readlines()` method instead of the `.read()` method.

In [None]:
# Open the file sample.txt in read mode
# creating the file object `f`
f = open('data/sample.txt', 'r')
file_contents = f.readlines()
print(file_contents)

With the `.read()` method, we read in the whole text file as a single string. The `.readlines()` gives us a Python list, where each item is a single string representing a single line of the text document. You may also notice that each line ends with `\n` which represents a line break in a string. If we print the string, the line break is visible in our output.

In [None]:
# Print the first item in the file_contents list
# Note the \n turns into a line break
print(file_contents[0])
f.close()

## Write to the File
To write to a file, we need to open it and create our file object in either write ('w') or append ('a') mode. 

### Append mode
Let's start with append mode which adds new data to the bottom of the file while leaving any previous information intact.

In [None]:
# Opening a file in append mode
# and creating a new file object 
# called `f`
f = open('data/sample.txt', 'a')

Now we can use the `.write()` method to append a string to the file. 

In [None]:
# Appending an eleventh line to the file
f.write('\nThis is the eleventh line')
f.close()

Can you read the file back in to see whether the `.write()` was successful?

In [None]:
# Open the the file in read mode
# create a file object called `sample_file`
f = 

# Use the .read() method on the file object
# Store the result in a variable `file_contents`
file_contents = 

# Print the contents
print(file_contents)

# Close the file
f.close()

### Write mode
Opening a file in write mode is useful in two scenarios:
* Creating a new text file and writing data to it
* Overwriting all data in the file with new data

Here is an example:

In [None]:
# Creating a new file in write mode
f = open('data/new_sample.txt', 'w')

# Define a string variable to add to the new file
string = 'Here is some data\nWith a second line'

# Using write method on the file object
contents = f.write(string)

# Close file object
f.close()

## Open/Close Files `with open`
The `with open` technique is commonly used in Python because it has two significant advantages:
* It is more compact 
* It automatically closes the file afterward

The basic form resembles a flow control statement, ending in a colon and then executing an indented block of code. After the block executes, the file is closed automatically.

In [None]:
with open('data/sample.txt', 'r') as f:
    print(f.read())

# Opening, Reading, and Writing CSV Files (.csv)
CSV file data can be easily opened, read, and written using the `pandas` library. (For large CSV files (>500 mb), you may wish to use the `csv` library to read in a single row at a time to reduce the memory footprint.) Pandas is flexible for working with tabular data, and the process for importing and exporting to CSV is simple.

In [None]:
# Import pandas 
import pandas as pd

# Create our dataframe
df = pd.read_csv('data/sample.csv')

In [None]:
# Display the dataframe
print(df)

After you've made any necessary changes in Pandas, write the dataframe back to the CSV file. (Remember to always back up your data before writing over the file.)

In [None]:
# Write data to new file
# Keeping the Header but removing the index
df.to_csv('data/new_sample.csv', header=True, index=False)

# Opening, Reading, and Writing JSON Files (.json)

JSON files use a key/value structure very similar to a Python dictionary. We will start with a Python dictionary called `py_dict` and then write the data to a JSON file using the `json` library.

In [None]:
# Defining sample data in a Python dictionary
py_dict = {
    "firstName": "Julia",
    "lastName": "Smith",
    "gender": "woman",
    "age": 57,
    "address": {
        "streetAddress": "11434",
        "city": "Detroit",
        "state": "Mi",
        "postalCode": "48202"
    },
    "phoneNumbers": [
        { "type": "home", "number": "7383627627" }
    ]
}

To write our dictionary to a JSON file, we will use the `with open` technique we learned that automatically closes file objects. We also need the `json` library to help dump our dictionary data into the file object. The `json.dump` function works a little differently than the write method we saw with text files. 

We need to specify two arguments: 

* The data to be dumped
* The file object where we are dumping

In [None]:
# Open/create sample.json in write mode
# as the file object `f`. The data in py_dict
# is dumped into `f` and then `f` is closed

import json
with open('data/sample.json', 'w') as f:
    json.dump(py_dict, f)

To read data in from a JSON file, we can use the `json.load` function on our file object. Here we load all the content into a variable called `content`. We can then print values based on particular keys.

In [None]:
with open('data/sample.json') as f:
    contents = json.load(f)
    print('First Name: ' + contents['firstName'])
    print('Last Name: '+ contents['lastName'])
    print('Age: ' + str(contents['age']))
    print('Phone Number: ', contents['phoneNumbers'][0]['number'])

# Opening datasets with `tdm_client`

The `tdm_client` helps retrieve a given dataset and/or its associated metadata. The metadata is supplied in a CSV file and the full dataset is supplied in a compressed JSON Lines file (.jsonl.gz). For any analysis focused on metadata pre-processing, we recommend users start with the CSV file since it is both easier and faster to view, parse, and manipulate.


## Metadata CSV vs. JSON Lines Data File
All of the textual data and metadata is available inside of the JSON Lines files, but we have chosen to offer the metadata CSV for two primary reasons:

1. The JSON Lines data is a little more complex to parse since it is nested. It cannot be easily represented in a table form in something like Pandas. It is nice to be able to easily view all the metadata in a Pandas dataframe.
2. The JSON Lines data can be very large. Each file contains all of the metadata *plus* unigram counts, bigram counts, trigram counts, and full-text (when available). Manipulating all that data takes significant computer time and costs. Even a modest dataset (~5000 files) can be over 1 GB in size uncompressed.

More information is available, including the metadata categories, in the FAQ ["What is the data file format?"](https://docs.tdm-pilot.org/what-format-are-jstor-portico-datasets/). 

## Retrieving data (`tdm_client` methods)

By passing the `tdm_client` a dataset ID (here called `dataset_id`), we can automatically download the metadata CSV file or the full JSON Lines dataset created by the [dataset builder](https://tdm-pilot.org/builder/).

* Use the `.get_metadata()` method to retrieve the metadata CSV file
* Use the `get_dataset()` method to retrieve the full JSON data file

|Code|Result|
|---|---|
|f = tdm_client.get_metadata(dataset_id)|Automatically retrieves a metadata CSV file and creates a file object `f`|
|f = tdm_client.get_dataset(dataset_id)| Automatically retrieves a compressed JSON Lines dataset file (jsonl.gz) and creates a file object `f`|

The JSON Lines file will be downloaded in a compressed gzip format (jsonl.gz). We can iterate over each document in the corpus by using the `dataset_reader()` method.

## Import the `tdm_client` methods

We'll import each of the methods individually so that we don't have to include `tdm_client` each time we call them.

In [None]:
# Import modules from the `tdm_client`
from tdm_client import get_dataset, get_metadata, dataset_reader

## Import your dataset

First, we'll use the `tdm_client` library to automatically retrieve the [metadata](https://docs.tdm-pilot.org/key-terms/#metadata) for a [dataset](https://docs.tdm-pilot.org/key-terms/#dataset). We can retrieve [metadata](https://docs.tdm-pilot.org/key-terms/#metadata) in a [CSV file](https://docs.tdm-pilot.org/key-terms/#csv-file) using the `get_metadata` method.

Enter a [dataset ID](https://docs.tdm-pilot.org/key-terms/#dataset-ID) in the next code cell. 

If you don't have a dataset ID, you can:
* Use the sample dataset ID already in the code cell
* [Create a new dataset](https://tdm-pilot.org/builder)
* [Use a dataset ID from other pre-built sample datasets](https://tdm-pilot.org/dataset/dashboard)

In [None]:
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is 
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"

Next pass the `dataset_id` as an argument using the `get_metadata` method.

In [None]:
# Pull in our dataset CSV
dataset_metadata = get_metadata(dataset_id)

Now we're ready to import pandas for our analysis and create a dataframe. We will use the `read_csv()` method to create our dataframe from the CSV file.

In [None]:
# Import pandas 
import pandas as pd

# Create our dataframe
df = pd.read_csv(dataset_metadata)

We can confirm the size of our dataset using the `len()` function on our dataframe.

In [None]:
original_document_count = len(df)
print('Total original documents:', original_document_count)

Now let's take a look at the data in our dataframe `df`. We will set pandas to show all columns using `set_option()` then get a preview using `head()`.

In [None]:
# Set the pandas option to show all columns
pd.set_option("max_columns", None) 

# Show the first five rows of our dataframe
df.head() 

---
## Metadata Type by Column Name

Here are descriptions for the metadata types found in each column:

|Column Name|Description|
|---|---|
|id|a unique item ID (In JSTOR, this is a stable URL)|
|title|the title for the item|
|isPartOf|the larger work that holds this title (for example, a journal title)|
|publicationYear|the year of publication|
|doi|the digital object identifier for an item|
|docType|the type of document (for example, article or book)|
|provider|the source or provider of the dataset|
|datePublished|the publication date in yyyy-mm-dd format|
|issueNumber|the issue number for a journal publication|
|volumeNumber|the volume number for a journal publication|
|url|a URL for the item and/or the item's metadata|
|creator|the author or authors of the item|
|publisher|the publisher for the item|
|language|the language or languages of the item (eng is the ISO 639 code for English)|
|pageStart|the first page number of the print version|
|pageEnd|the last page number of the print version|
|placeOfPublication|the city of the publisher|
|wordCount|the number of words in the item|
|pageCount|the number of print pages in the item|
|outputFormat|what data is available ([unigrams](https://docs.tdm-pilot.org/key-terms/#unigram), [bigrams](https://docs.tdm-pilot.org/key-terms/#bigram), [trigrams](https://docs.tdm-pilot.org/key-terms/#trigram), and/or full-text)|

If there are any columns you would like to drop from your analysis, you can drop them with:

`df df.drop(['column_name1', 'column_name2', ...], axis=1)`

In [None]:
# Drop each of these named columns
df = df.drop(['outputFormat', 'pageEnd', 'pageStart', 'datePublished', 'language'], axis=1)

# Show the first five rows of our updated dataframe
df.head()

If you would like to know if a particular id is in the dataframe, you can use the `in` operator to return a boolean value (True or False). 

In [None]:
# Check if a particular item id is in the `id` column
'http://www.jstor.org/stable/2868641' in df.id.values

## Filtering Out Unwanted Texts

Now that we have filtered out unwanted metadata columns, we can begin filtering out any texts that may not match our research interests. Let's examine the first and last twenty rows of the dataframe to see if we can identify texts that we would like to remove. We are looking for patterns in the metadata that could help us remove many texts at once.

In [None]:
# Preview the first twenty items in the dataframe
# df.head(20) # Change 20 to view a greater or lesser number of rows

In [None]:
# Preview the last twenty items in the dataframe
# df.tail(20) # Change 20 to view a greater or lesser number of rows

### Remove all rows without data for a particular column

For example, we may wish to remove any texts that do not have authors. (In the case of journals, this may be helpful for removing paratextual sections such as the table of contents, indices, etc.) The column of interest in this case is `creator`. 

In [None]:
# Remove all texts without an author
df = df.dropna(subset=['creator']) #drop each row that has no value under 'creators'

In [None]:
# Print the total original documents followed by the current number
print('Total original documents:', original_document_count)
print('Total current documents: ', len(df))

### Remove row based on the content of a particular column

We can also remove texts that have a particular value in a column. Here are a few examples.

In [None]:
# Remove all items with a particular title
df = df[df.title != 'Review Article'] # Change `Review Article` to your desired title

In [None]:
# Remove all items with less than 3000 words
df = df[df.wordCount > 3000] # Change `3000` to your desired number

In [None]:
# Print the total original documents followed by the current number
print('Total original documents:', original_document_count)
print('Total current documents: ', len(df))

Take a final look at your dataframe to make sure the current texts fit your research goals. In the next step, we will save the IDs of your pre-processed dataset.

In [None]:
# Preview the first 50 lines of your dataset
df.head(50)

## Saving a list of IDs to a CSV file

In [None]:
# Write the column "id" to a CSV file called `pre-processed_###.csv` where ### is the `dataset_id`
df["id"].to_csv('data/pre-processed_' + dataset_id + '.csv')

Download the "pre-processed_###.csv" file (where ### is the `dataset_id`) for future analysis. You can use this file in combination with the dataset ID to automatically filter your texts and reduce the processing time of your analyses.

---
## Visualizing the Pre-Processed Data

In [None]:
# Group the data by publication year and the aggregated number of ids into a bar chart
df.groupby(['publicationYear'])['id'].agg('count').plot.bar(title='Documents by year', figsize=(20, 5), fontsize=12); 

# Read more about Pandas dataframe plotting here: 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

And now let's look at the total page numbers by year.

In [None]:
# Group the data by publication year and aggregated sum of the page counts into a bar chart

df.groupby(['publicationYear'])['pageCount'].agg('sum').plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12);

## Reading in data from a dataset object

The `dataset_reader()` method will read in data from the compressed JSON dataset file object. By keeping the data in a compressed format and reading in a single line at a time, we reduce the processing time and memory use. These can be substantial for large datasets. Even a modest dataset (~5000 files) can be over 1 GB in size uncompressed.

The `dataset_reader()` essentially unzips each row of the dataset at a time. Each row constitutes all the metadata and data available for a single document. Here is what that looks like the actual tdm_client:

```with gzip.open(file_path, "rb") as input_file:
        for row in input_file:
            yield json.loads(row)
```

In most cases, users will want to iterate over every document/row in the JSON Lines file. In practice, that looks like:

|Code|Result|
|---|---|
|for document in tdm_client.dataset_reader(f):| Iterates over each document in file object `f`|


In [None]:
# Pull in our dataset JSON Lines file
f = get_dataset(dataset_id)

## Working with unigram counts in the dataset file

The most significant data for text analysis is usually the "unigramCount" section where the frequency of each word is recorded. In this context, the word "unigram" describes a single word construction like the word "chicken." 


In this section we will:
- pre-process unigrams
- create, explore, and clean a dataframe with unigram counts
- export the dataframe to CSV

### Import libraries for working with dataset files

We'll be using several packages for keeping track of execution time, working with JSON Lines files, and text pre-processing. 

In [None]:
# Import datetime
import datetime

# Import jsonlines 
import jsonlines

# Import packages for text pre-processing 
from string import punctuation, whitespace
import contractions
from nltk.corpus import stopwords

### Load `preprocess_text` function

To simplify the code, we'll go ahead and define the function below to pre-process the unigrams we'll be working with.

In [None]:
def preprocess_text(text):
    # Remove leading/trailing whitespace and punctuation
    text = text.strip(whitespace + punctuation)
    
    # Lowercase all characters
    text = text.lower()
    
    # Expand contractions, split at the new space, and remove auxillary
    # You can comment out or delete this step if you want to retain contractions.
    text = contractions.fix(text)
    text = text.split(" ", 1)
    text = text[0]
    
    # Return the pre-processed text
    return text

### Creating a new dataset file with pre-processed unigrams

Now we can pre-process the unigrams and create a new dataset file containing and metadata and updated unigram counts. We won't be working with bigrams and trigrams in this workshop, so we will leave those out of the new file, which allows us to work with a smaller file (less time and computational resources required).

In [None]:
# Start timer and set counter for progess tracking
start_time = datetime.datetime.now()
counter = 0

# Assign the output file
output_file = 'data/%s_%s.jsonl' % (dataset_id, 'unigrams')


# Open dataset file for writing
with jsonlines.open(output_file, mode='w') as writer:
    
    # Iterate through documents/lines in JSON Lines file
    for document in dataset_reader(f):
        new_document = {}
        
        # Iterate through keys in current document
        for key in document.keys():
            
            # Process unigrams and add/update unigram counts
            if key == 'unigramCount':
                unigramCount = document[key]
                new_unigramCount = {}
                
                # Iterate through unigram counts
                for unigram in unigramCount.keys():
                    
                    # Pre-process unigram
                    new_unigram = preprocess_text(unigram)
                    
                    # Update existing unigram in dictionary
                    if new_unigram in new_unigramCount:
                        new_unigramCount[new_unigram] += unigramCount[unigram]
                    
                    # Add unigram to dictionary
                    else:
                        new_unigramCount[new_unigram] = unigramCount[unigram]
                
                # Add unigram count for new document metadata
                new_document[key] = new_unigramCount
            
            # Skip bigramCount, trigramCount, and fulltext keys
            elif key in ['bigramCount', 'trigramCount', 'fullText', 'fulltext']:
                continue
            
            # Add all other fields to new document metadata
            else:
                new_document[key] = document[key]
        
        # Write document metadata to JSON Lines file
        writer.write(new_document)
        
        # Report count of processed documents
        counter += 1
        print("Documents processed: %s" % counter, end="\r")
        if counter == 2500:
            break

# End timer and print execution time
end_time = datetime.datetime.now()
print("Execution Time: " + str(end_time - start_time))

### Creating a dataframe with unigram counts

Next we'll create a dataframe with the ids and unigram counts. This will allow us to more easily explore and visualize the count data. 

In [None]:
# Start timer and set counter for progess tracking
start_time = datetime.datetime.now()
counter = 0

# Create a list to hold each row of document data 
document_data = []

# Open output file for reading in document data, line-by-line
with jsonlines.open(output_file) as reader:
    for document in reader:
        
        # Iterate through keys in dictionary containing document data
        for key in document.keys():
            
            # Save id
            if key == 'id':
                unique_id = document[key]
            
            # Iterate through unigramCount dictionary and save unigram and count
            elif key == 'unigramCount':
                unigramCount = document[key]
                for unigram in unigramCount:
                    count = unigramCount[unigram]
                    
                    # Add id, unigram, and count to a row in document data list
                    row = (unique_id, unigram, count)
                    document_data.append(row)
    
        # Report count of processed documents
        counter += 1
        print("Documents processed: %s" % counter, end="\r")

# Create document dataframe
print("\nCreating dataframe...")
document_df = pd.DataFrame(document_data, columns =['id', 'unigram', 'count'])

Let's see how many rows we have and take a look at the first 20 rows of the dataset.

In [None]:
# Display length of dataframe
print(document_df.size)

# Display the first 20 rows of dataframe
document_df.head(20)

Looking at the dataframe, you may want to sort them so that the most frequent words are at the stop and descend from there. Let's go ahead and do that, then check the first 20 rows again. 

In [None]:
# Sort dataframe values by count, in descending order
# If you want to sort in ascending order, you can change the Boolean value of `ascending` to True
document_df_sorted = document_df.sort_values(by=['count'], ascending=False)

# Display the first 20 rows of dataframe
document_df_sorted.head(20)

### Removing empty values and stopwords from unigram counts

You'll probably notice that the most common unigrams are empty values and stopwords (commonly used and auxiliary words, such as “the”, “a”, “an”, “in,” “she”). You might want to get rid of those. 

This step may take a bit longer than the others because it's a bit resource-consuming to check if each stopword exists in the dataframe and then remove all rows containing them.

In [None]:
# Start timer, get length of stopword list, and set counter for progess tracking
start_time = datetime.datetime.now()
stopwords_count = len(stopwords.words('english'))
counter = 0

# Remove all rows where unigram is an empty value
document_df_sorted.drop(document_df_sorted[ (document_df_sorted['unigram'] == '')].index, inplace=True)

# Iterate through stopwords list and remove rows where unigram is a stopword
for stopword in stopwords.words('english'):
    document_df_sorted.drop(document_df_sorted[ (document_df_sorted['unigram'] == stopword)].index, inplace=True)
    
    # Increment count and display progress report
    counter += 1
    print("Progress: %6.2f%%" % ((counter/stopwords_count)*100), end='\r')

# Notify that program has completed successfully
print("\nComplete!")

# Stop timer, then calculate and display program execution time
end_time = datetime.datetime.now()
print("Execution Time: " + str(end_time - start_time))  

Let's check the first 20 rows of the dataframe one last time.

In [None]:
document_df_sorted.head(20)

You might also want to remove numbers, certain characters, etc. But we won't worry about that now. 

### Saving your unigram count data to a CSV file

You'll probably want to save your count data to a CSV file to work with later. Let's do that now.

In [None]:
# Write document dataframe to a CSV file
print("Writing CSV file...")
document_data_csv = open('data/%s_unigrams.csv' % dataset_id, 'w', encoding='utf-8', newline='')
document_data_csv.write(document_df.to_csv(index=False))
document_data_csv.close()

# Notify that program has completed successfully
print("Complete!")

# Stop timer, then calculate and display program execution time
end_time = datetime.datetime.now()
print("Execution Time: " + str(end_time - start_time))

## Next Steps

At this point, you could create a dataframe for unigram counts aross the whole dataset by dropping the ids and aggregating the counts. Then you also use what we learned earlier in this notebook to visualize the counts of the top unigrams in the corpus. If you're interested in them, you could apply what we did with unigrams to the bigrams and/or trigrams. The world is your oyster! 

If you need any help, feel free to [email me](mailto:tyt3@pitt.edu) or use the following support resources:
- [Attend Office Hours for Beta Participants](https://docs.tdm-pilot.org/office-hours-for-beta-participants/)
- [Email the Constellate Team](mailto:tdm@ithaka.org)
- [Join the Email Group](https://ithaka.groups.io/g/tdm-jstor-portico)
- Join the Constellate Slack Channel ([email me](mailto:tyt3@pitt.edu))