# **Unit 1 Assignment: Topics 1-3**

## *DATA 5420/6420*
## Name: Patrick Neyland

For this first assignment I want you to spend some time thinking about a dataset you might want to work with throughout the semester to build something cool! This could be a personal project, something you build for someone else, or maybe even the start of a business -- get creative!

Once you have selected your dataset or data source, I want you to apply the skills you've learned from Unit 1 (topics 1-3), including finding and loading a data source, describing and understanding your text data, and then cleaning/preprocessing your data to prep it for feature engineering!

We of course aren't ready to begin critically analyzing your text yet, but use this as an opportunity to explore different sources of text, consider what you'd be interested in building something around, and start jogging ideas of how you'd like to see this project progress throughout the semester.

**If you are in the 5420 (undergraduate) section of this course**, you may choose a dataset that has been precompiled from a source like [Kaggle](https://www.kaggle.com/datasets?tags=13204-NLP), though you are not constrained to this source!

**If you are in the 6420 (graduate) section of this course**, I'd like you to source your text data from an API or scraping, or if you have data from work/another personal project, that's fine too.

**If you use ChatGPT or other LLMs (which I highly encourage) -- share your prompts in the template -- I'd love to see your approach!**


> *Throughout this assignment I will ask you to provide comments in your code that indicate what each step/line is doing. This is a great habit to get into to not only make sure you know what's going on as you're learning new code, but to also ensure that anyone else who might access your code in the future can read it and understand your process! You will be docked points if you fail to include this documentation!*

## **Import Dependencies \& Packages**

In [9]:
# import required libraries and packages
import PyPDF2
import nltk
import re

# Download NLTK resources (tokenizers, stopwords, etc.)
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Patri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Patri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
import nltk, re, pprint

from urllib import request
from bs4 import BeautifulSoup                                                                                   # needed for parsing HTML

import contractions                                                                                             # contractions dictionary
from string import punctuation

#import spacy                                                                                                    # used for lemmatization/stemming
#python -m spacy download en_core_web_sm                # OR in Jupyter download in terminal using spacy download en_core_web_sm

from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
tokenizer = ToktokTokenizer()                                                                                   # stopword removal
from nltk import word_tokenize

import pandas as pd
import numpy as np        

## **Part 1: Selecting \& Importing a Data Source**

Remember, if you are in the DATA 6420 section, your data needs to be sourced from either: an API or web-scraping (or if you have data from work/personal project already, that's fine too).

In [32]:
# import data
def extract_text_from_pdf(pdf_file):
    raw = ""
    with open(pdf_file, "rb") as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            raw += page.extract_text()
    return raw

In [33]:
pdf_file_path = r"data\text.pdf"
text = extract_text_from_pdf(pdf_file_path)

print(text)

THE ACCOUNTING REVIEW American Accounting Association
Vol. 97, No. 1 DOI: 10.2308/TAR-2018-0694
January 2022pp. 29–49
In Defense of Limited Manufacturing Cost Control:
Disciplining Acquisition of Private Information by Suppliers
Anil Arya
Brian Mittendorf
The Ohio State University
Ram N. V. Ramanan
Binghamton University, SUNY
ABSTRACT: When a firm’s input supplier can acquire and misreport private information to gain an edge in
negotiations, we show that the firm can blunt the supplier’s informational advantage by permitting inefficiencies in itsown internal production. Specifically, we establish that a modest increase in the cost of the input(s) a firm makesinternally credibly commits it to be more aggressive in negotiations with a supplier for the input(s) the firm buys.
Recognizing that its potential information rents will be limited, the supplier, in turn, becomes less aggressive in
information acquisition. The paper fully characterizes the equilibrium—the firm’s investments, the s

*(Add in any LLM prompts/relevant bits of conversations along the way -- share your approach!)*
#### GPT Prompt
How do I import text data from .pdf files into Python to do some NLP?

**What was your motivation for choosing this data set/source?**

I chose this dataset because it is an academic article that could be used for one of my final project ideas. 

**I imagine you're still work-shopping a plan, but what are some interesting things you hope to do with this data going forward in the class?**

I outline this above in my ideas list. 

**Provide or create a description card for this dataset that includes things like:**

* Source of the data : The Accounting Review
* Date of data collection : 9/14/2023
* Genre : Accounting
* Intended use cases : reviewing accoutning research
* Potential sources of bias : 
* Anything interesting about the structure or format of the text : An academic paper found in the journal 'The Accounting Review'

## **Part 2: Cleaning \& Preprocessing the Dataset**

As applicable apply any necessary cleaning to your dataset in this next step, thinking things like special character removal, HTML parsing, etc.

In [34]:
print("[", text.find("INTRODUCTION"), ":", text.rfind("REFERENCES"), "]")

[ 1274 : 69544 ]


In [35]:
text = text[1274 : 69544]

In [36]:
nltk.download('stopwords')
tokenizer = ToktokTokenizer()
stopword_list = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Patri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [38]:
nltk.download('punkt')
def basic_text_cleaner(text):
    # Remove characters that are not letters, whitespaces, or periods
    text = re.sub(r'[^A-za-z0-9\s\.]', '', text)
    # Tokenize and perform stopword removal, and casefolding
    tokens = word_tokenize(text)
    #tokens = [token.lower() for token in tokens if token.lower() not in stopword_list]

    # Join tokens and trim extra whitespace
    cleaned_text = ' '.join(tokens).strip()

    return cleaned_text

cleaned_text = basic_text_cleaner(text)
cleaned_text

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Patri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


'INTRODUCTION Make and buy decisions are familiar issues for accountants . Often the problem is cast as a choice between either making or buying an input . While the make versus buy decision is indeed reective of some circumstances manufacturing rms often opt to both make and buy inputs . That is rms purchase some inputs from external suppliers while relying on inhouse production for others they nd areas of specialization and outsource those for which theyhave no discernable expertise . In this paper we examine such sourcing circumstances and demonstrate that informationalreasons can result in the costs of made inputs and the cost of outsourced inputs to interact . In particular higher internal makecosts credibly commit a rm to be tough in procurement negotiations with its external supplier . This in turn undercuts the suppliers incentive to acquire and misreport private information that would have given it an edge in negotiations . Broadly stated internal production inefciencies come 

**Describe any cleaning steps you applied to your data**

**Now let's consider what preprocessing steps may or may not be necessary for your given dataset -- maybe even thinking ahead to what you plan to do with your data later on...**

**Which of the following preprocessing steps will you apply to your data (at least for now)?**



*   Casefolding
*   Contraction Expansion
*   Stopword removal
*   Lemmatization
*   Stemming
*   Other?

**Explain which you choose to apply and why, as well as which you are choosing to not apply and why.**

In [None]:
# apply your preprocessing to create a 'cleaned' version of your text

## **Part 3: A bit of Text Exploration**

Spend some time exploring your data by looking at text statistics and text visualization (frequency distribution plot or word cloud, etc.)

In [None]:
# text statistics -- e.g. total number of words, total unique words, lexical diversity, top most frequent words, etc.

In [None]:
# an interesting or useful text visualization

**What insights have you gained through exploring your text with some basic text statistics and visualizations?**

**Are there any changes you'd make to the way you preprocessed your text based on your findings? E.g. a custom stop-word list**

**Now that you've chosen and explored a data set/source, what will your next step be to progress this project? (Not for you to do now, but just thinking ahead).**

## **For next class...Be prepared to:**

**Provide a brief (sub 2 minute) stand up to the class to share:**


*   **What you're working on (your dataset source)**
*   **Any road blocks you've run into**
*   **What you plan/hope to do next in your project**

