<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email author@email.address.<br />
____

# `Course Title` `1/2/3`

This is lesson `1` of 3 in the educational series on `TOPIC`. This notebook is intended `to teach XXX and introduce the concepts of XXXX`. 

**Skills:** 
* Data analysis
* Machine learning
* Text analysis
* Language models
* Retrieval Augmented Generation
* Text classification
* spaCy
* Vector databases
* Semantic search
* R
* Python

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Reference` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Beginner` / `Intermediate` / `Advanced`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* Object-oriented programming (classes, instances, inheritance)
* Regular Expressions (`re`, character classes)

These should be general skills but can mention a particular library
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
* Data cleaning with `Pandas`
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Describe and implement an XXXX for XXXX
2. Convert XXXX into XXXX for the purpose of XXXX
3. Develop a workflow in order to XXXX
4. Be familiar with XXXXX resources for pursuing the topic
```
**Research Pipeline:**
```
1. Research steps before this notebook
2. **The skills in this notebook**
3. Steps after this notebook
4. Final steps
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* [Tesseract](https://tesseract-ocr.github.io/) for performing [optical character recognition](https://docs.constellate.org/key-terms/#ocr).
* [Pandas](https://pandas.pydata.org/) for manipulating and cleaning data.
* [Pdf2image](https://pdf2image.readthedocs.io/en/latest/) for converting pdf files into image files.

## Install Required Libraries

In [None]:
### Install Libraries ###

# Using !pip installs
!pip install pdf2image

In [None]:
### Import Libraries ###
import urllib.request

# Required Data

`List out the data sources, including their formats and a few sentences describing the data. Include a link to the data source description, if possible.`

**Data Format:** 
* image files (.jpg, .png)
* document files (.pdf)
* plain text (.txt)

**Data Source:**
* [Detroit Open Data Portal](https://data.detroitmi.gov/datasets/detroitmi::dpd-citizen-complaints/about)

**Data Quality/Bias:**
`Analysis of this data should consider the following quality and bias issues...`

**Data Description:**

`This lesson uses XXXX data in XXX format from XXXX source. Additional details about the data used.`

## Download Required Data

In [None]:
### Grab files with console `wget` and `mv` ###
!wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
!mv eng.traineddata /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata


In [None]:
### Grab a single file and supply name ###
urllib.request.urlretrieve('https://file.address.txt', 'filename.txt')

In [None]:
### Retrieve multiple files using a list and string splitting###

download_urls = [
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_01.pdf',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_02.pdf',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_03.pdf'
]

for url in download_urls:
    urllib.request.urlretrieve(url, url.rsplit('/', 1)[-1])

In [None]:
### Retrieve multiple files using a list and Path ###
from pathlib import Path
import urllib.request

# Check if a folder exists to hold pdfs. If not, create it.
pdfs_folder = Path.cwd() / 'data' / 'sample_pdfs'
pdfs_folder.mkdir(parents=True, exist_ok=True)

# Define a list of URLs for our sample pdfs. 
download_urls = [
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_01.pdf',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_02.pdf',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_03.pdf'
]

# For each URL, download the file and put it into the folder with the original name
for url in download_urls:
    path_url = Path(url)
    urllib.request.urlretrieve(url, f'{pdfs_folder.as_posix()}/{path_url.name}')
    
## Success message
print('Folder created and pdfs added.')

In [None]:
### Constellate Example ###

# Importing your dataset with a dataset ID
import constellate
# Pull in the sampled dataset (1500 documents) that matches `dataset_id`
# in the form of a gzipped JSON lines file.
# The .get_dataset() method downloads the gzipped JSONL file
# to the /data folder and returns a string for the file name and location
dataset_file = constellate.get_dataset(dataset_id)

# To download the full dataset (up to a limit of 25,000 documents),
# request it first in the builder environment. See the Constellate Client
# documentation at: https://constellate.org/docs/constellate-client
# Then use the `constellate.download` method show below.
#dataset_file = constellate.download(dataset_id, 'jsonl')


# Introduction

```
Introduce the lesson topic. Answer questions such as:
* Why is it useful? 
* Why should we learn it? 
* Who might use it? 
* Where has it been used by scholars/industry?
* What do we need to do it?
* What subjects are included in the notebooks?
* What is not in this notebook? Where should we look for it?
```

# Lesson

## Style tips for writing your lesson body

### Creating your table of contents and sections
Break down your sections using markdown headings of different sizes (greater or fewer #s). Users will have access to a table of contents automatically generated by headings.

### Markdown cheatsheet

A [quick cheatsheet](https://www.markdownguide.org/cheat-sheet/) for useful markdown in Jupyter. When adding images, please be sure to include an alternative description for accessibility purposes.

### Data/Image/Video hosting

Github is not a good place to store images, data, or other large files for your lesson. Please store them somewhere else (Google Drive, Dropbox, Amazon S3, etc.).

### Creating screenshots

#### Mac
I recommend using command-shift-4 to draw a square around the part of the screen you want to screenshot. I often use [paintbrush](https://paintbrush.sourceforge.io/) for basic editing. For moving images (.gifs), I use command-shift-5 then convert the video using [Gifski](https://github.com/sindresorhus/Gifski). Make sure you use very few frames-per-second because gif file size can get large very quickly.

#### Windows
I recommend capturing a screenshot using the print-screen key on your keyboard, then pasting into Microsoft Paint. At that point, it is easy to crop just the relevant section.

___
[Proceed to next lesson: Course Title 2/3 ->](./lesson-2.ipynb)

# Exercises (Optional)

`If possible, include practice exercises for users to do on their own. These may have clear solutions or be more open-ended.`

# Solutions (Optional)
`Offer some possible solutions for the practice exercises.`


# References (Optional)
No citations required but include this if you have cited academic sources. Use whatever format you like, just be consistent. Markdown footnotes are not well-supported in notebooks.[$^{1}$](#1) I suggest using an anchor link with plain html as shown.[$^{2}$](#2)

1. <a id="1"></a> Here is an anchor link footnote.
2. <a id="2"></a> D'Ignazio, Catherine and Lauren F. Klein. [*Data Feminism*](https://mitpress.mit.edu/books/data-feminism). MIT Press, 2020.