# Lab 2 - Data Types, String/File Processing, Encodings

In this second lab we will be looking at some of the more important skills needed to do text analysis. In order to start doing text analysis you first need to have text to analyze. This text will be stored in various data types. In the last lab you already saw two data types: int(for storing whole numbers) and strings. In this lab we will look at other important data types needed to do text processing.

Next we will look at string and file processing. We will learn how to read data to and from text files, how to store the data in data types such as strings and lists. And then we will do string processing on this data that is now stored in a string or list.

Lastly one of the most important things is file encodings. Computers store data as 0s and 1s so in order to read a text file we need to convert those 0s and 1s to letters. This is done using an encoding scheme. The problem is that there are lots of encoding schemes and different ones are used all the time. Some of the more common ones are utf-8 and utf-16 as well as latin1. So whenever you open a text file you need to make sure you are using the correct encoding scheme.

### Data Types

Data types are one of the basic building blocks of your code. As mentioned earlier, data types are used to store information which we can then work with in various ways. In python the data type is automatically determined when you first assign a value to a variable. So when we assigned the number '18' to the variable 'age' in the last lab, the variable 'age' was turned into an integer data type.

#### Storing Numbers

For storing numbers there are two data types: the `int` data type for whole numbers and the `float` data type for decimal numbers. The `int` data type will store things like word counts and sentence counts and the `float` data type will store things like average word length.

Work through the following code block:

In [5]:
# The following three variables are of type int since they contain whole numbers
age1 = 18
age2 = 22
age3 = 19

# to find the data type you can use 'type()'
print(type(age1))

# complete the computation to find the average age:
# hint: add the ages and divide by 3
averageAge = (age1 + age2 + age3) / 3

# finally print the value of averageAge and its data type:
print(type(averageAge))
print(averageAge)

<class 'int'>
<class 'float'>
19.666666666666668


#### Storing Text

We already encountered the `string` data type for saving text data. Another important data type is `list`. The `list` data type holds multiple items in a list format and each item can be accessed by its index. The index for the first item in the list starts at 0 and increments by one.

While we will often use the list data type for storing words or text data it can be used for anything since it is just a way to store multiple objects in a list format.

Work through the next code block:

In [6]:
# to assign a list variable put all items in quotes, separated by commas, and enclosed in square brackets:
sentence1 = ["Daisy", "picks", "some", "flowers", "."]

# we can print the contents of this variable like any other variable by using print()
print(sentence1)

# we can also print each item individually by putting the index of the item we want in square brackets:
print(sentence1[0])

['Daisy', 'picks', 'some', 'flowers', '.']
Daisy


In [7]:
# now using string concatenation print the sentence "Daisy picks flowers."
# remember to add the spaces between the words

sentence2 = "Daisy" + " " + "picks" + " " + "flowers" + "."
print(sentence2)

Daisy picks flowers.


The last data type we will look at is the dictionary. A dictionary is very similar to a list but instead of accessing it using an index starting at 0 we assign each value a special index. This special index is called the key. 

There are various ways to create a dict data type. We will use this type: 
`dictName = {'key1':'value1', 'key2':'value2'}`
To read about the other ways of creating a dict, and for a good overview of some of the functions we can use with the dict data type view this [dict tutorial](https://realpython.com/python-dicts/)

One use may be for storing token counts of files. I will demonstrate this use in the following code block.

In [2]:
# Dictionary for storing token counts per file
tokenCount = {'file1':13200, 'file2':12093, 'file':29093}

# to call up the token count of a certain file pass the key as an argument to the dict
print(tokenCount['file'])

29093


### File Processing

To understand file processing and encodings read the file manipulation tutorial which will give an overview of all of the techniques and functions you will be using: [file manipulation](https://realpython.com/read-write-files-python/)

Opening a file using the `with open() as` will store the opened file in a data type that we can then use to read the file contents and store the contents to a variable. I will demonstrate how to open a file and save the contents to a string variable.

In [5]:
# First, create a text file somewhere in your computer. Write/paste a couple of sentences and save the file
# Set path to file. Change this to the directory in your computer where you have a file
from pathlib import Path

folder = Path("/Users/mathiasgausachs/SDA 250/")

filePath = folder / "lab2text.rtf"

# open the file as "r" or read only and store this opened file in f
with open(filePath, "r") as f:
    # read the data from f and store it in the string variable "data"
    data = f.read()
    
# we can now print the data
print(data)

{\rtf1\ansi\ansicpg1252\cocoartf2577
\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww11520\viewh8400\viewkind0
\pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural\partightenfactor0

\f0\fs24 \cf0 Hello world, this is a text file.}


### Importing Libraries

Python only comes with a very basic set of functions that we can use such as `print()` this means that if we want to do more interesting things, like get word count or sentence count, we need to import libraries that let us do so. A library is like a collection of functions that we can use in our programs. To import a library we just put the `import` command followed by the name of the library before we use functions from that library.

Two libraries, or modules, that we will be using in this lab are `os` and `nltk`. The `os` module will be used for moving around the file system and listing files in a given directory. The `nltk` module contains many usefull functions for text processing and we will use it throughout the semester.

First we will look at NLTK. Run the module below and enter "d" for download and then "book" to download the book dataset from nltk. When it is finished downloading enter "q" to exit the download screen. For a list of all the datasets included in nltk you can enter "l".

In [6]:
# first we import the NLTK module
import nltk

# after importing we will download the dataset
nltk.download()

True

Now that the data set has been downloaded we can import it using `from nltk.book import *`. After importing we will look at some functions introduced in chapter 1 of the [nltk textbook](http://www.nltk.org/book/ch01.html).

In [13]:
# import the data set
from nltk.book import *

# I usually try to avoid 'import *'

In [18]:
# we can check what each text is by simply entering the text1-text9
print(text1)
print(text2)
print(text4)

<Text: Moby Dick by Herman Melville 1851>
<Text: Sense and Sensibility by Jane Austen 1811>
<Text: Inaugural Address Corpus>


In [23]:
# one basic function is len() which returns the length of the text
print(len(text1))

# another is count() which will count the number of occurences of a word
print(text1.count("ship"))

# we can do some basic calculations such as what percentage of the text is made up by a specific word
word = input('what word do you want: ')
percentage = 100 * text1.count(word)/len(text1)
print(percentage)

260819
507
what word do you want: captain
0.04294165685782094


Like mentioned earlier, the `os` library is used for interacting with the filesystem. Some usefull functions are `getcwd()` which prints your current directory, `listdir()` which lists all files in your current directory and `rename()` which lets you rename a file in the directory.

In [24]:
# first lets import the os module
import os

# next we can print our current directory
# notice how we need to specify the module first followed by the function
print(os.getcwd())

# next we list the contents of our current directory
print(os.listdir())

/Users/mathiasgausachs/Documents/GitHub/lab2_new-mathias-gc
['dataset1', '.ipynb_checkpoints', 'Lab2-draft.ipynb', '.git']


When we printed the contents of our current folder you saw that there is a subfolder called 'dataset1'. In the next code block you will print the contents of that subfolder so that you know how many files you have for the last assignment in this lab.

In order to print the contents of a subfolder you will need to provide the path to that folder as an argument to the `listdir()` function. However the path to our current folder can be quiet lengthy but there is a way to shorten this. In python when you deal with paths there are two "shortcuts". A single period `.` means "starting from the current directory". So if we want to list the contents of the subfolder "dataset1" we would simply provide `./dataset1/`. If we wanted to go up one directory from our current directory you would use `..` which means "go up one directory".

In [28]:
# This will list the contents of the directory above where we currently are
print(os.listdir("../"))

# now print the contents of the "dataset1" directory
print(os.listdir("./dataset1"))

['.DS_Store', 'lab2_new-mathias-gc', 'lab-1-mathias-gc', 'desktop-tutorial']
['adventuresOfHuckleberryFinn.txt', 'aTaleOfTwoCities.txt', 'prideAndPrejudice.txt']


Now, to put everything together, write a small script that gets the data from the dataset included with this lab. Since there are 3 files included in the dataset1 folder you will need to `open()` and `read()` 3 times. Then pick 3 words and find what percentage of the text they make up together. Your output should look something like this:
```
text1:
word1 percentage = x%
word2 percentage = y%
word3 percentage = z%
total percentage = xyz%
text2:
```

Make sure you format the output so that it is easy to read.

In [118]:
from pathlib import Path
datasetFolder = Path("./dataset1")

# Text 1
print('Text 1')
text_1 = datasetFolder / "adventuresOfHuckleberryFinn.txt"
with open(text_1, "r") as file_1:
    data_1 = file_1.read()

words_text1 = {'escape', 'river', 'hat'}
total_1 = 0
pcts_1 = []
for word in words_text1:
    percentage = 100 * data_1.count(word)/len(data_1)
    total_1 = total_1 + percentage
    pcts_1.append(percentage)
    print(f'{word} percentage = {percentage}')

print(f'Total Percentage = {total_1}')
    
# Text 2
print('\nText 2')
text_2 = datasetFolder / "aTaleOfTwoCities.txt"
with open(text_2, "r") as file_2:
    data_2 = file_2.read()

words_text2 = {'fire', 'city', 'night'}
total_2 = 0
pcts_2 = []
for word in words_text2:
    percentage = 100 * data_2.count(word)/len(data_2)
    total_2 = total_2 + percentage
    pcts_2.append(percentage)
    print(f'{word} percentage = {percentage}')
    
print(f'Total Percentage = {total_2}')

# Text 3
print('\nText 3')
text_3 = datasetFolder / "prideAndPrejudice.txt"
with open(text_3, "r") as file_3:
    data_3 = file_3.read()

words_text3 = {'dear', 'spoke', 'walk'}
total_3 = 0
pcts_3 = []
for word in words_text3:
    percentage = 100 * data_3.count(word)/len(data_3)
    total_3 = total_3 + percentage
    pcts_3.append(percentage)
    print(f'{word} percentage = {percentage}')
    
print(f'Total Percentage = {total_3}')

Text 1
escape percentage = 0.0006729905343881338
river percentage = 0.024395906871569853
hat percentage = 0.3001537783371077
Total Percentage = 0.3252226757430657

Text 2
fire percentage = 0.009004188233841342
city percentage = 0.006560194284655835
night percentage = 0.03215781512086193
Total Percentage = 0.04772219763935911

Text 3
spoke percentage = 0.0077345402653720766
walk percentage = 0.016500352566127096
dear percentage = 0.021527803738618945
Total Percentage = 0.04576269657011812


#### Storing to a file

We learned how to open and read from a file but often we want to store our data to a file as well so that we can use it in the future. Writing to a file is very similar to reading from a file. The following code block will demonstrate how to save to a .txt file.

In [74]:
from pathlib import Path

folder = Path("/Users/mathiasgausachs/SDA 250/")
filePath = folder / 'textFile.txt'

# open a file like we did before
with open(filePath, 'w') as out:
    # now instead of calling read() we call write() to save to a file.
    out.write("This line will be saved to the file.\n")

As you can see saving to a file is just as easy as reading from a file and will be usefull for storing results and other data that you obtain from your programs. Saving data will become increasingly more important throughout this course as we will be dealing with more and more data. To finish the lab we will look at one other important file type and how to read and write to it. This file type is the "csv" format. This format is used for data that is stored in a table format like an excel spreadsheet. We will first create and write to a csv file and then we will open and read the data that we wrote.

First create a new subfolder in your file system. Then in the code block we will open the csv file just as we did with the .txt files. The only difference is in how we actually write to the file. To write to a csv file we will create a "file writer" that we will call to write to the file.

In [125]:
from pathlib import Path
import csv

folder = Path("/Users/mathiasgausachs/SDA 250/Lab 2")
sprsPath = folder / 'sprs.csv'

# finish the code to open the csv file inside the subdirectory that you created
with open(sprsPath, 'w') as out:
    line_writer = csv.writer(out, delimiter=',')
    
    line_writer.writerow(['fileName', 'word', 'wordPercentage'])
    for (word, pct) in zip(words_text1, pcts_1):
        line_writer.writerow(['adventuresOfHuckleberryFinn', word, pct])
    for (word, pct) in zip(words_text2, pcts_2):
        line_writer.writerow(['aTaleOfTwoCities', word, pct])
    for (word, pct) in zip(words_text3, pcts_3):
        line_writer.writerow(['prideAndPrejudice', word, pct])

Now that we have saved our data to a csv file we want to retrieve it again so that we can do further analysis on it. Reading from a csv file is very similar to reading from a text file so just work through the following code block to learn how to do so and finish the lab.

In [126]:
# first open the file as a read only object
with open(sprsPath, 'r') as f:
    # here we create a reader object, similar to our writer object
    csv_reader = csv.reader(f, delimiter=',')
    
    # we will now use something called a for loop to loop through each row of data
    # we will look more at for loops in the next lab
    for row in csv_reader:
        # we print each row of data
        print(row)

['fileName', 'word', 'wordPercentage']
['adventuresOfHuckleberryFinn', 'escape', '0.0006729905343881338']
['adventuresOfHuckleberryFinn', 'river', '0.024395906871569853']
['adventuresOfHuckleberryFinn', 'hat', '0.3001537783371077']
['aTaleOfTwoCities', 'fire', '0.009004188233841342']
['aTaleOfTwoCities', 'city', '0.006560194284655835']
['aTaleOfTwoCities', 'night', '0.03215781512086193']
['prideAndPrejudice', 'spoke', '0.0077345402653720766']
['prideAndPrejudice', 'walk', '0.016500352566127096']
['prideAndPrejudice', 'dear', '0.021527803738618945']
