# Module 0 Project 1: Data Collection

- Implement a data collection pipeline from a sample source
- The goal of this project is to learn the basic steps for finding a data source and loading it into our code for further analysis
- This project simply collects and displays the data shown. Preprocessing comes in the next module

## STEP 1: DATA ACCESS
- Find a data source (Kaggle, Wikipedia, web scraping, etc.)
- Get access to the [data](https://www.gutenberg.org/cache/epub/2554/)
- Here we are using Crime and Punishment by Dostoevsky (from Project Gutenberg)

In [None]:
import requests
import os
import re
import pandas as pd

# Project Gutenberg Crime And Punishment
resp = requests.get("https://www.gutenberg.org/cache/epub/2554/pg2554.txt")

## STEP 2: INITIAL DATA COLLECTION
- Perform initial collection steps like filtering and saving to a file
- Using regex to extract important text from the response - we know the book starts and ends with the below markers
- Open the text and write it to a file

In [None]:
# Tags for text filtering
start = "*** START OF THE PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***"
end = "*** END OF THE PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***"

# Grab the text body from the document
pattern = re.compile(f"{re.escape(start)}(.*?)\n{re.escape(end)}", re.DOTALL)
text = pattern.findall(resp.text)
result = "\n".join(text)

# Save the text body to a local file
if not os.path.exists("./crime_and_punishment.txt"):
    with open("crime_and_punishment.txt", "w") as file:
        file.write(result)

## STEP 3: BASIC DATA ORGANIZATION
- Peform steps here to extract important info from the collected data
- We are just using a dicitonary here to perform a simple word count

In [None]:
# Function to get the count of all words in a given text body
def count_words(text):
    words = re.findall(r'\b\w+\b', text.lower())
    word_count = {}
    for word in words:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1
    return word_count

## STEP 4: DISPLAY OUTPUTS
- Display collected data points from the data source
- We now should have a 'usable' dataset of values from a given data source to perform further analysis and training
- We have put the data in a labeled pandas DataFrame to easily sort and index the values

In [None]:
# Get the word counts, put it in a Pandas dataframe, and sort by count
word_count = count_words(result)
df = pd.DataFrame(list(word_count.items()), columns=['Word', 'Count'])
sorted_df = df.sort_values(by='Count', ascending=False)

# Print the result
print(sorted_df)
print(sorted_df.describe())