# Wrangling Text Data in Python

Same thing, but now only with codeblocks and output

## Reading a file as a string

In [2]:
filename = 'example.txt'
with open(filename, 'r') as file_in:
    raw_text = file_in.read()

print(raw_text[:1000])

 1
Class Goal
The goal of this class is to make the skills of data wrangling in Python familiar and easy to use.
2 Class Description
This course covers data selection, cleaning, and manipulation in Python, in- cluding reading and writing data, the Pandas library for cleaning, transforming, merging, reshaping, and data aggregation. This course will make use of Collab for handling assignments and grading. We will also use GitHub for some code management (making an account is advised but not required). This course is worth 1 credit.
3 Readings
There is one book for this course, Python for Data Analysis. It is published by O’Reilly Media. You can read it free online via the UVa library (the link is on the Collab page). Readings will be assigned to supplement the in class portion of the class.
1
4 Assessment
Grades will be assigned based on performance on 4 homework assignments each worth 100 points. The total of all points earned will be compared with the table below to determine the final

## installing nltk

In [None]:
%%bash
$ pip3 install nltk
$ python3
>>> import nltk
>>> nltk.download()

## reading a file as lines

In [3]:
filename = 'example.txt'
with open(filename, 'r') as file_in:
    raw_text = file_in.readlines()

raw_text[:10]

[' 1\n',
 'Class Goal\n',
 'The goal of this class is to make the skills of data wrangling in Python familiar and easy to use.\n',
 '2 Class Description\n',
 'This course covers data selection, cleaning, and manipulation in Python, in- cluding reading and writing data, the Pandas library for cleaning, transforming, merging, reshaping, and data aggregation. This course will make use of Collab for handling assignments and grading. We will also use GitHub for some code management (making an account is advised but not required). This course is worth 1 credit.\n',
 '3 Readings\n',
 'There is one book for this course, Python for Data Analysis. It is published by O’Reilly Media. You can read it free online via the UVa library (the link is on the Collab page). Readings will be assigned to supplement the in class portion of the class.\n',
 '1\n',
 '4 Assessment\n',
 'Grades will be assigned based on performance on 4 homework assignments each worth 100 points. The total of all points earned will

## problems with reading by "lines"

In [4]:
raw_text[4]

'This course covers data selection, cleaning, and manipulation in Python, in- cluding reading and writing data, the Pandas library for cleaning, transforming, merging, reshaping, and data aggregation. This course will make use of Collab for handling assignments and grading. We will also use GitHub for some code management (making an account is advised but not required). This course is worth 1 credit.\n'

# segmenting with nltk.sent_tokenize()

In [5]:
import nltk
filename = 'example.txt'
with open(filename, 'r') as file_in:
    raw_text = file_in.read()

sentences = nltk.sent_tokenize(raw_text)
sentences[:10]

[' 1\nClass Goal\nThe goal of this class is to make the skills of data wrangling in Python familiar and easy to use.',
 '2 Class Description\nThis course covers data selection, cleaning, and manipulation in Python, in- cluding reading and writing data, the Pandas library for cleaning, transforming, merging, reshaping, and data aggregation.',
 'This course will make use of Collab for handling assignments and grading.',
 'We will also use GitHub for some code management (making an account is advised but not required).',
 'This course is worth 1 credit.',
 '3 Readings\nThere is one book for this course, Python for Data Analysis.',
 'It is published by O’Reilly Media.',
 'You can read it free online via the UVa library (the link is on the Collab page).',
 'Readings will be assigned to supplement the in class portion of the class.',
 '1\n4 Assessment\nGrades will be assigned based on performance on 4 homework assignments each worth 100 points.']

## can we even tell what a sentence is?

> Mrs. Rachael, I needn’t inform you who were acquainted with the late Miss Barbary’s affairs, that her means die with her and that this young lady, now her aunt is dead–”

> “My aunt, sir!”

> “It is really of no use carrying on a deception when no object is to be gained by it,” said Mr. Kenge smoothly, “Aunt in fact, though not in law.

## mixing the approaches

In [6]:
import nltk
filename = 'example.txt'
with open(filename, 'r') as file_in:
    raw_lines = file_in.readlines()

sents = []
for line in raw_lines:
    sents.extend(nltk.sent_tokenize(line))

sents[:10]

[' 1',
 'Class Goal',
 'The goal of this class is to make the skills of data wrangling in Python familiar and easy to use.',
 '2 Class Description',
 'This course covers data selection, cleaning, and manipulation in Python, in- cluding reading and writing data, the Pandas library for cleaning, transforming, merging, reshaping, and data aggregation.',
 'This course will make use of Collab for handling assignments and grading.',
 'We will also use GitHub for some code management (making an account is advised but not required).',
 'This course is worth 1 credit.',
 '3 Readings',
 'There is one book for this course, Python for Data Analysis.']

## core structure - the word

In [7]:
tokens = nltk.word_tokenize(raw_text)
print(len(tokens))
tokens[:10]

448


['1', 'Class', 'Goal', 'The', 'goal', 'of', 'this', 'class', 'is', 'to']

## difficulties with tokenizers

In [8]:
tokens[40:42]

['in-', 'cluding']

## token vs type 

In [9]:
tokens[:5]

['1', 'Class', 'Goal', 'The', 'goal']

In [10]:
print(len(tokens))
unique_tokens = set(tokens)
print(len(unique_tokens))

448
222


## more preprocessing

In [11]:
print(len(tokens))
print(len(set(tokens)))
tokens = [token.lower() for token in tokens]
print(len(set(tokens)))

448
222
205


## natural language processing pipeline two examples

First example (word2vec - cares about sentence structure and ordering)
* get text
* segment text
* tokenize texts, but preserve the sentences
* lowercase all tokens
* remove punctuation and stopwords
* analyze

Second example (stylometry - cares about punctuation and part of speech use)
* get text
* tokenize text (don't care about sentences)
* lowercase all tokens
* tag all words with parts of speech
* analyze

Third example (topic modeling with bag of words)
* get text
* divide each text into a hundred pieces
* tokenize text
* throw away the token ordering
* analyze