# Lab 2 - Data Types, String/File Processing, Encodings

In this second lab we will be looking at some of the more important skills needed to do text analysis. In order to start doing text analysis you first need to have text to analyze. This text will be stored in various data types. In the last lab you already saw two data types: int (for storing whole numbers) and strings. In this lab we will look at other important data types needed to do text processing.

Next we will look at string and file processing. We will learn how to read data to and from text files, how to store the data in data types such as strings and lists. And then we will do string processing on this data that is now stored in a string or list.

Lastly one of the most important things is file encodings. Computers store data as 0s and 1s so in order to read a text file we need to convert those 0s and 1s to letters. This is done using an encoding scheme. The problem is that there are lots of encoding schemes and different ones are used all the time. Some of the more common ones are utf-8 and utf-16 as well as latin1. So whenever you open a text file you need to make sure you are using the correct encoding scheme.

### Data Types

Data types are one of the basic building blocks of your code. As mentioned earlier, data types are used to store information which we can then work with in various ways. In python the data type is automatically determined when you first assign a value to a variable. So when we assigned the number '18' to the variable 'age' in the last lab, the variable 'age' was turned into an integer data type.

#### Storing Numbers

For storing numbers there are two data types: the `int` data type for whole numbers and the `float` data type for decimal numbers. The `int` data type will store things like word counts and sentence counts and the `float` data type will store things like average word length.

Work through the following code block:

In [1]:
# The following three variables are of type int since they contain whole numbers
age1 = 18
age2 = 22
age3 = 19

# to find the data type you can use 'type()'
print(type(age1))

# then we compute the average and print the result and the type of averageAge
averageAge = (age1 + age2 + age3)/3
print(averageAge)
print(type(averageAge))

<class 'int'>
19.666666666666668
<class 'float'>


#### Storing Text

We already encountered the `string` data type for saving text data. Another important data type is `list`. The `list` data type holds multiple items in a list format and each item can be accessed by its index. The index for the first item in the list starts at 0 and increments by one.

While we will often use the list data type for storing words or text data it can be used for anything since it is just a way to store multiple objects in a list format.

Work through the next code block:

In [2]:
# to assign a list variable put all items in quotes, separated by commas, and enclosed in square brackets:
sentence1 = ["Daisy", "picks", "some", "flowers", "."]

# we can print the contents of this variable like any other variable by using print()
print(sentence1)

# we can also print each item individually by putting the index of the item we want in square brackets:
print(sentence1[0])

['Daisy', 'picks', 'some', 'flowers', '.']
Daisy


In [3]:
# now using string concatenation print the sentence "Daisy picks flowers."
# remember to add the spaces between the words
print((sentence1 [0], sentence1 [1], sentence1[3]))

('Daisy', 'picks', 'flowers')


The last data type we will look at is the dictionary. A dictionary is very similar to a list but instead of accessing it using an index starting at 0 we assign each value a special index. This special index is called the key. 

There are various ways to create a dict data type. We will use this type: 
`dictName = {'key1':'value1', 'key2':'value2'}`
To read about the other ways of creating a dict, and for a good overview of some of the functions we can use with the dict data type view this [dict tutorial](https://realpython.com/python-dicts/)

One use may be for storing token counts of files. I will demonstrate this use in the following code block.

In [4]:
# Dictionary for storing token counts per file
tokenCount = {'file1':13200, 'file2':12093, 'file':29093}

# to call up the token count of a certain file pass the key as an argument to the dict
print(tokenCount['file1'])

13200


### File Processing

To understand file processing and encodings read the file manipulation tutorial which will give an overview of all of the techniques and functions you will be using: [file manipulation](https://realpython.com/read-write-files-python/)

Opening a file using the `with open() as` will store the opened file in a data type that we can then use to read the file contents and store the contents to a variable. I will demonstrate how to open a file and save the contents to a string variable. The `open()` function takes two inputs: first the name of the file, and second whether you want to read("r") or write("w") to the file.

First let's create our own text file in a new folder. Inside the folder that contains this notebook create a new folder, let's call it "dataset0", in which we will create our text file. Now using notepad create a file with some text in it and save it inside the dataset0 folder. Now read through the code block to see how you can open and read this file.

To learn more about filepaths read the following chapter up to and including the "Absolute vs. Relative Paths": https://automatetheboringstuff.com/chapter8/

### Importing Libraries

Python only comes with a very basic set of functions that we can use such as `print()` this means that if we want to do more interesting things, like get word count or sentence count, we need to import libraries that let us do so. A library is like a collection of functions that we can use in our programs. To import a library we just put the `import` command followed by the name of the library before we use functions from that library.

Two libraries, or modules, that we will be using in this lab are `os` and `nltk`. The `os` module will be used for moving around the file system and listing files in a given directory. The `nltk` module contains many usefull functions for text processing and we will use it throughout the semester. It's part of a great [introduction to NLP using python](https://www.nltk.org/book/). 

First we will look at NLTK. In addition to the functions in the NLTK library we will be using a dataset that is provided through the NLTK module. After running the code block below you should see a window opening somewhere if you are on windows. In this window select the "book" dataset and download it. If you are on Mac or Linux you should see a download bar pop up below the code block. Enter 'd' for download and press enter followed by the identifier "book" to download the book dataset.

In [2]:
# create a variable to hold the path to the file
filePath = "./dataset0/Filename.txt"

# open the file as "r" or read only and store this opened file in f
with open(filePath, "r") as f:
    # read the data from f and store it in the string variable "data"
    data = f.read()
    
# we can now print the data
print(data)

Hello World! 


In [1]:
pwd

'/Users/shubhkarman/Desktop/SDA250/lab2/Untitled'

In [3]:
# first we import the NLTK module
import nltk

# after importing we will download the dataset
#nltk.download()

Now that the data set has been downloaded we can import it using `from nltk.book import *`. After importing we will look at some functions introduced in chapter 1 of the [nltk textbook](http://www.nltk.org/book/ch01.html).

In [4]:
# import the data set
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [5]:
# we can check what each text is by simply entering the text1-text9
print(text1)
print(text4)

<Text: Moby Dick by Herman Melville 1851>
<Text: Inaugural Address Corpus>


In [6]:
# one basic function is len() which returns the length of the text
print(len(text1))

# another is count() which will count the number of occurences of a word
print(text1.count("ship"))

# we can do some basic calculations such as what percentage of the text is made up by a specific word
percentage = 100 * text1.count("ship")/len(text1)
print(percentage)

260819
507
0.19438767881174301


Like mentioned earlier, the `os` library is used for interacting with the filesystem. Some usefull functions are `getcwd()` which prints your current directory, `listdir()` which lists all files in your current directory and `rename()` which lets you rename a file in the directory.

In [7]:
# first lets import the os module
import os

# next we can print our current directory
# notice how we need to specify the module first followed by the function
print(os.getcwd())

# next we list the contents of our current directory
print(os.listdir())

/Users/shubhkarman/Desktop/SDA250/lab2/Untitled
['saveFile.txt', 'dataset0', 'Lab2best.ipynb', 'dataset1', 'Lab2.ipynb', '.github', '.ipynb_checkpoints', '.git']


When we printed the contents of our current folder you saw that there is a subfolder called 'dataset1'. In the next code block you will print the contents of that subfolder so that you know how many files you have for the last assignment in this lab.

In order to print the contents of a subfolder you will need to provide the path to that folder as an argument to the `listdir()` function. However the path to our current folder can be quiet lengthy but there is a way to shorten this. In python when you deal with paths there are two "shortcuts". A single period `.` means "starting from the current directory". So if we want to list the contents of the subfolder "dataset1" we would simply provide `./dataset1/`. If we wanted to go up one directory from our current directory you would use `..` which means "go up one directory".

In [9]:
# This will list the contents of the directory above where we currently are
print(os.listdir("../"))

# now print the contents of the "dataset1" directory
print(os.listdir("./dataset1"))

['.DS_Store', '.ipynb_checkpoints', 'Untitled']
['.gitkeep', 'adventuresOfHuckleberryFinn.txt', 'aTaleOfTwoCities.txt', 'prideAndPrejudice.txt']


Now, to put everything together, write a small script that gets the data from the dataset included with this lab. Since there are 3 files included in the dataset1 folder you will need to `open()` and `read()` 3 times. Then pick 3 words and find what percentage of the text they make up together. Your output should look something like this:
```
text1:
word1 percentage = x%
word2 percentage = y%
word3 percentage = z%
total percentage = xyz%
text2:
```

Make sure you format the output so that it is easy to read.

In [10]:
# create a variable to hold the path to the file
filePath = "./dataset1/adventuresOfHuckleberryFinn.txt"
# open the file as "r" or read only and store this opened file in f
with open(filePath, "r", encoding = "utf8") as f:
    # read the data from f and store it in the string variable "ahf"
    ahf = f.read()
    
# we can now print the data
print(ahf)

﻿The Project Gutenberg eBook of Adventures of Huckleberry Finn, Complete, by Mark Twain (Samuel Clemens)

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: Adventures of Huckleberry Finn, Complete

Author: Mark Twain (Samuel Clemens)

Release Date: August, 1993 [eBook #76]
[Most recently updated: January 21, 2021]

Language: English

Character set encoding: UTF-8

Produced by: David Widger

*** START OF THE PROJECT GUTENBERG EBOOK HUCKLEBERRY FINN ***




ADVENTURES

OF

HUCKLEBERRY FINN

(Tom Sawyer's Comrade)

By Mark Twain

Complete




CONTENTS.

CHAPTER I. Civilizing Huck.--Miss Watson.-

In [11]:
# one basic function is len() which returns the length of the text
print(len(ahf))

# another is count() which will count the number of occurences of a word
print(ahf.count("States"))

# we can do some basic calculations such as what percentage of the text is made up by a specific word
percentageStates = 100 * ahf.count("States")/len(ahf)
print(percentageStates)

594362
18
0.003028457404746602


In [12]:
# one basic function is len() which returns the length of the text
print(len(ahf))

# another is count() which will count the number of occurences of a word
print(ahf.count("Singing"))

# we can do some basic calculations such as what percentage of the text is made up by a specific word
percentageSinging = 100 * ahf.count("Singing")/len(ahf)
print(percentageSinging)

594362
1
0.00016824763359703346


In [13]:
# one basic function is len() which returns the length of the text
print(len(ahf))

# another is count() which will count the number of occurences of a word
print(ahf.count("his"))

# we can do some basic calculations such as what percentage of the text is made up by a specific word
percentageHis = 100 * ahf.count("his")/len(ahf)
print(percentageHis)

594362
897
0.150918127336539


In [33]:
# one basic function is len() which returns the length of the text
print(len(ahf))



# we can do some basic calculations such as what percentage of the text is made up by a specific word
percentageTotal = 100 * (percentageStates)/(percentageSinging)/(percentageHis)
print(percentageTotal)

594362
11926.996655518395


In [17]:
# create a variable to hold the path to the file
filePath = "./dataset1/aTaleOfTwoCities.txt"
# open the file as "r" or read only and store this opened file in f
with open(filePath, "r", encoding = "utf8") as f:
    # read the data from f and store it in the string variable "ttc"
    ttc = f.read()
    
# we can now print the data
print(ttc)

﻿The Project Gutenberg EBook of A Tale of Two Cities, by Charles Dickens

This eBook is for the use of anyone anywhere in the United States and most
other parts of the world at no cost and with almost no restrictions
whatsoever.  You may copy it, give it away or re-use it under the terms of
the Project Gutenberg License included with this eBook or online at
www.gutenberg.org.  If you are not located in the United States, you'll have
to check the laws of the country where you are located before using this ebook.

Title: A Tale of Two Cities
       A Story of the French Revolution
       
Author: Charles Dickens

Release Date: January, 1994 [EBook #98]
[Most recently updated: December 20, 2020]

Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK A TALE OF TWO CITIES ***




Produced by Judith Boss, and David Widger




A TALE OF TWO CITIES

A STORY OF THE FRENCH REVOLUTION

By Charles Dickens


CONTENTS


     Book the First--Recalled to Life

   

In [19]:
# one basic function is len() which returns the length of the text
print(len(ttc))

# another is count() which will count the number of occurences of a word
print(ttc.count("Before"))

# we can do some basic calculations such as what percentage of the text is made up by a specific word
percentageBefore = 100 * ttc.count("Before")/len(ttc)
print(percentageBefore)

777416
10
0.0012863126048344774


In [20]:
# one basic function is len() which returns the length of the text
print(len(ttc))

# another is count() which will count the number of occurences of a word
print(ttc.count("Two"))

# we can do some basic calculations such as what percentage of the text is made up by a specific word
percentageTwo = 100 * ttc.count("Two")/len(ttc)
print(percentageTwo)

777416
33
0.004244831595953775


In [21]:
# one basic function is len() which returns the length of the text
print(len(ttc))

# another is count() which will count the number of occurences of a word
print(ttc.count("City"))

# we can do some basic calculations such as what percentage of the text is made up by a specific word
percentageCity = 100 * ttc.count("City")/len(ttc)
print(percentageCity)

777416
4
0.0005145250419337909


In [32]:
# one basic function is len() which returns the length of the text
print(len(ttc))



# we can do some basic calculations such as what percentage of the text is made up by a specific word
percentageTotal = 100 * (percentageBefore)/(percentageTwo)/(percentageCity)
print(percentageTotal)

777416
58895.15151515153


In [23]:
# create a variable to hold the path to the file
filePath = "./dataset1/prideAndPrejudice.txt"
# open the file as "r" or read only and store this opened file in f
with open(filePath, "r", encoding = "utf8") as f:
    # read the data from f and store it in the string variable "pap"
    pap = f.read()
    
# we can now print the data
print(pap)

﻿
The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Pride and Prejudice

Author: Jane Austen

Release Date: August 26, 2008 [EBook #1342]
Last Updated: November 12, 2019


Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***




Produced by Anonymous Volunteers, and David Widger

THERE IS AN ILLUSTRATED EDITION OF THIS TITLE WHICH MAY VIEWED AT EBOOK
[# 42671 ]

cover




      Pride and Prejudice

      By Jane Austen

        CONTENTS

         Chapter 1

         Chapter 2

         Chapter 3

         Chapter 4

         Chapter 5

         Chapter 6

         Chapter 7

         Chapter 8

         Chapter 9

         Chapter 10

         Chapter 1

In [24]:
# one basic function is len() which returns the length of the text
print(len(pap))

# another is count() which will count the number of occurences of a word
print(pap.count("Long"))

# we can do some basic calculations such as what percentage of the text is made up by a specific word
percentageLong = 100 * ttc.count("Long")/len(pap)
print(percentageLong)

775741
103
0.0016758170574972832


In [25]:
# one basic function is len() which returns the length of the text
print(len(pap))

# another is count() which will count the number of occurences of a word
print(pap.count("Should"))

# we can do some basic calculations such as what percentage of the text is made up by a specific word
percentageShould = 100 * ttc.count("Should")/len(pap)
print(percentageShould)

775741
0
0.00012890900442286794


In [28]:
# one basic function is len() which returns the length of the text
print(len(pap))

# another is count() which will count the number of occurences of a word
print(pap.count("And"))

# we can do some basic calculations such as what percentage of the text is made up by a specific word
percentageAnd = 100 * ttc.count("And")/len(pap)
print(percentageAnd)

775741
154
0.01869180564131585


In [31]:
# one basic function is len() which returns the length of the text
print(len(pap))


# we can do some basic calculations such as what percentage of the text is made up by a specific word
percentageTotal = 100 * (percentageLong)/(percentageShould)/(percentageAnd)
print(percentageTotal)

775741
69549.19310344828


#### Storing to a file

We learned how to open and read from a file but often we want to store our data to a file as well so that we can use it in the future. Writing to a file is very similar to reading from a file. The following code block will demonstrate how to save to a .txt file. In order to store data to a file you will first need to create the file somewhere. If you create a file called "saveFile.txt" in the same folder as the notebook then the following code block will work without any changes. If you create the file somewhere else you will need to update the path to the file in the `open()` function.

In [34]:
# open a file like we did before
with open("saveFile.txt", 'w') as out:
    # now instead of calling read() we call write() to save to a file.
    out.write("This line will be saved to the file.\n")

As you can see saving to a file is just as easy as reading from a file and will be usefull for storing results and other data that you obtain from your programs. Saving data will become increasingly more important throughout this course as we will be dealing with more and more data. To finish the lab we will look at one other important file type and how to read and write to it. This file type is the "csv" format. This format is used for data that is stored in a table format like an excel spreadsheet. We will first create and write to a csv file and then we will open and read the data that we wrote.

First create an empty .csv file somewhere. Then in the code block we will open the csv file just as we did with the .txt files. The only difference is in how we actually write to the file. To write to a csv file we will create a "file writer" that we will call to write to the file.

In [43]:
# finish the code to open the csv file that you created
import csv 
with open('csv', 'w') as out:
    # now we create our file writer object
    # the delimiter is what is used to separate our data into columns
    # Here we use a comma because it is a csv file, that is a file with comma-separated values
    line_writer = csv.writer(out, delimiter=',')
    
    # now that we have a writer object we can use it to write our rows
    # below I demonstrate how to use the writer to create our headings
    line_writer.writerow(['fileName', 'word', 'wordPercentage'])
    line_writer.writerow(['adventuresOfHuckleberryFinn', 'States', percentageStates])
    line_writer.writerow(['adventuresOfHuckleberryFinn', 'Singing', percentageSinging])
    line_writer.writerow(['adventuresOfHuckleberryFinn', 'his', percentageHis])
    line_writer.writerow(['aTaleOfTwoCities', 'Before', percentageBefore])
    line_writer.writerow(['aTaleOfTwoCities', 'Two', percentageTwo])
    line_writer.writerow(['aTaleOfTwoCities', 'City', percentageCity])
    line_writer.writerow(['prideAndPrejudice', 'Long', percentageLong])
    line_writer.writerow(['prideAndPrejudice', 'Should', percentageShould])
    line_writer.writerow(['prideAndPrejudice', 'And', percentageAnd])
    # finish the code to write the information from the word percentages to the csv file

Now that we have saved our data to a csv file we want to retrieve it again so that we can do further analysis on it. Reading from a csv file is very similar to reading from a text file so just work through the following code block to learn how to do so and finish the lab.

In [44]:
# first open the file as a read only object
with open('csv', 'r') as f:
    # here we create a reader object, similar to our writer object
    csv_reader = csv.reader(f, delimiter=',')
    
    # we will now use something called a for loop to loop through each row of data
    # we will look more at for loops in the next lab
    for row in csv_reader:
        # we print each row of data
        print(row)

['fileName', 'word', 'wordPercentage']
['adventuresOfHuckleberryFinn', 'States', '0.003028457404746602']
['adventuresOfHuckleberryFinn', 'Singing', '0.00016824763359703346']
['adventuresOfHuckleberryFinn', 'his', '0.150918127336539']
['aTaleOfTwoCities', 'Before', '0.0012863126048344774']
['aTaleOfTwoCities', 'Two', '0.004244831595953775']
['aTaleOfTwoCities', 'City', '0.0005145250419337909']
['prideAndPrejudice', 'Long', '0.0016758170574972832']
['prideAndPrejudice', 'Should', '0.00012890900442286794']
['prideAndPrejudice', 'And', '0.01869180564131585']
