# <center>Session 3 - Reading/Writing Data</center>

<hr>

## Table of Contents

<b>

* <a href="#Reading Text">Reading Text</a>
* <a href="#CSV">CSV</a>
* [Exercise 1](#Exercise 1)
* [Finding Multiple Files](#Glob)
* [Exercise 2](#Exercise2)
* <a href="#Writing">Writing</a>
* <a href="#Requests">Requests</a>
* [Exercise 3](#Exercise3)

</b>


<a id="Reading Text"></a>
<hr>
# Reading Text

Python comes with a built-in function 'open()', which takes the path (relative or absolute) to the file you're interested in opening.  

    # Example 
    file = 'data/text_file.txt'
    file_hdl = open(file)
    text = file_hdl.read()
    
    file_hdl.close() # Don't forget to close that file handle!!

In [159]:
file = 'data/text_file.txt'
file_hdl = open(file)
text = file_hdl.read()

file_hdl.close()

print(text)

You’ve successfully read in a text file!


Python has an more consise way to do this that doesn't require the developer to remember to close the file handle. 

    with open(file) as file_hdl:
        text = file_hdl.read()
        
When the indented block ends, python closes the file handle automatically. 

In [161]:
with open(file) as file_hdl:
    text_v2 = file_hdl.read()
    
print(text_v2)

You’ve successfully read in a text file!


<a id="CSV"></a>
<hr>
# CSV

A CSV (Comma Seperated Value) file is a lot like an excel workbook (and in fact can be read by excel).  Python has a library intended for reading and writing CSV files, and it relies on the built in open function. 

    import csv  # You only need to run this once per project
    
    csv_path = "data/quiz_questions.csv"
    
    with open(csv_path) as csvfile:
        csv_reader = csv.reader(csvfile)
        for row in csv_reader:
             print(row)
    
    

In [167]:
import csv  # You only need to run this once per project
    
csv_path = "data/quiz_questions.csv"
    
with open(csv_path) as csvfile:
    csv_reader = csv.reader(csvfile)
    for row in csv_reader:
         print(row)

['question', 'optiona', 'optionb', 'optionc', 'optiond', 'optiona_response', 'optionb_response', 'optionc_response', 'optiond_response']
['How many sides does a triangle have?', '1', '2', '3', '4', 'What do we call something with one side?', 'Maybe draw something with only two sides first', 'correct', 'draw a four sided shape then try again']


Note that it pulls the header as the first row - Having a header be able to map to each row would be really useful.  The CSV library has a dictionary reader that reads each line in as a single dictionary, mapping the column header to the value in the row.  

    with open(csv_path) as csvfile:
        csv_reader = csv.DictReader(csvfile)
        for row in csv_reader:
             print(row)

In [170]:
with open(csv_path) as csvfile:
    csv_reader = csv.DictReader(csvfile)
    for row in csv_reader:
         print(row)

OrderedDict([('question', 'How many sides does a triangle have?'), ('optiona', '1'), ('optionb', '2'), ('optionc', '3'), ('optiond', '4'), ('a_response', 'What do we call something with one side?'), ('b_response', 'Maybe draw something with only two sides first'), ('c_response', 'correct'), ('d_response', 'draw a four sided shape then try again')])


In [171]:
row['question']

'How many sides does a triangle have?'

<a id="Exercise 1"></a>

# Exercise 1

Take the dictionary we just read in using CSV, and ask the user for their answer.  If that answer is not mapped to the value 'correct' in the dictionary, print the hint and ask again. 

Look at the dictionary keys if you're not sure exactly where to begin!

In [None]:
## Your code here

<hr>
<a id="Glob"></a>
# Finding multiples files

If you have a folder with a lot of files, and you're looking for a generic pattern or extension, using a library like glob can make you life a lot easier. 

In the 'data/articles' folder there is a bunch of .txt files, each containing an article pulled from NPR.  There's also a .csv file containing additional information about the articles.  If we wanted to read ALL of the txt files, we can get a list of them in one fell swoop. 

In [174]:
import glob

path = "data/articles"
txt_list = glob.glob(path + "/*.txt")

print(len(txt_list))
print(txt_list[0])

48
data/articles/267166222.txt


<a id="Exercise2"></a>

# Exercise 2

Look at the file '/data/articles/article_log.csv', and get a sense of what it contains.  Note that the id maps to a filename (ie ID 569893288 maps to 569893288.txt)

Write some code that reads in each line in the article_log.csv as a dictionary, and add each dictionary to a larger dictionary, mapping the ID of the article to that dictionary for that line. See below for the results of reading in a single line. 


    {'569893288': {('ID', '569893288'),
               ('Date', '2017-12-14'),
               ('Title', "NPR's Favorite TV Shows Of 2017"),
               ('Link',
               'https://www.npr.org/sections/monkeysee/2017/12/14/569893288/nprs-favorite-tv-shows-of-2017')])}

In [181]:
data_dict = {}

with open('data/articles/article_log.csv') as file_hdl:
    reader = csv.DictReader(file_hdl)
    
    for line in reader:
        data_dict[line["ID"]] = line
        
print(len(data_dict))


48


### Exercise 2.1

Loop over all the article txt files you found.  For each file:

1. Get the ID from the filename
2. Read in the content
3. Get the following and add them to the dictionary for each article:
    3. Get the number of characters for each article
    4. Get the number of sentences for each article
    5. Get the number of paragraphs for each article
    
Potentially useful functions:

If you have a string as a variable, you can replace text in it, using .replace()
    
    For example, if you have the variable 
    
    test = "textfile.txt"
    test = test.replace(".txt", "")
    # test now equals 'textfile'
    
    test2 = "Sentence one. Sentence 2"
    split_test2 = test2.split(".")
    # split_test2 is now a list containing the string "Setence one" and " Sentence 2"
    
In a string variaible, a newline can be found using the '\n' character.  In many news articles, a paragraph can be indentified by two newline characters in a row. 

In [194]:
for file in txt_list:
    file_id = file.replace("data/articles/", "").replace(".txt","")
    
    with open(file) as file_hdl:
        text = file_hdl.read()
        
        length = len(text)
        sentences = len(text.split("."))
        paragraphs = len(text.split("\n\n"))
        
        data_dict[file_id]["Length"] = length
        data_dict[file_id]["Sentences"] = sentences 
        data_dict[file_id]["Paragraphs"] = paragraphs
        
        

<a id="Writing"></a>

<hr>
# Writing to File

The open function can take more than just a filename. The second argument, which is optional, defaults to reading, but can have a number of other options.  You can tell it to append to a file, write to a file, write to a file ONLY if the file doesn't already exist, and more. 

In the case of writing to a file, we pass the 'w' argument.  

To write the CSV file with the adjusted dictionaries, we'll use the the DictWriter from the csv library, and a 'w' in our open call

In [195]:
file_out = "data/articles/article_log_improved.csv"
with open(file_out, 'w') as file_out_hdl:
    # This line is a bit confusing. 
    # We need to tell the writer WHERE to write, but also what the headers are. 
    # To get the headers, we need to get ANY single line dictionary, and get the keys for it
    writer = csv.DictWriter(file_out_hdl, next (iter (data_dict.values())).keys())
    
    writer.writeheader()

    for key,value in data_dict.items():
        writer.writerow(value)

<hr>
<a id="Writing"></a>

# Requests

One final way to get data is the requests library.  We can use it to get all the contant of a URL, whether that be HTML, JSON, or some other internet format. 

In [199]:
import requests

path = 'http://text.npr.org/s.php?sId=572945894'
r = requests.get(path)

In [200]:
type(r)

requests.models.Response

In [201]:
r

<Response [200]>

In [219]:
r.text

'    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html>\n<head><title>Text-Only NPR.org : How An American Became Santa In A Little Town In France</title><meta name="robots" content="noindex, nofollow" />\n<meta name="viewport" content="width=device-width, initial-scale=1" /></head>\n<body>\n<script type="text/javascript">\nif (window != top)\n  {\n      top.location.href = location.href;\n  }\n</script>\n<p>Text-Only NPR.org (go to <a href="https://www.npr.org">full version</a>)</p>\n<p><a href="/">Home</a> &gt; <a href="/p.php?pid=3"> Program: Morning Edition</a></p>\n<p>How An American Became Santa In A Little Town In France</p>\n<p>By Eleanor Beardsley</p>\n<p>Morning Edition,  &middot; Aurelie Garat still can\'t get over how she found her Pere Noel this year. She\'s the Christmas pageant organizer for the tiny Normandy town of Vimoutiers. </p>\n<p>"I was parking when I saw a nice young man with a beautiful beard sitting in his car on the phone," says Garat. "So

<a id="Exercise3"></a>

# Exercise 3

Use string functions to print ONLY the text between the paragraph tags in the html. 

To find the starting index of a character sequence in a string, use the .find() function. 

    test_str = "The quick brown fox jumped over the lazy dog"
    fox_idx = test_str.find("fox") #now fox_idx will equal 16
    
    print(test_str[fox_idx:]) ## Will print 'fox jumped over the lazy dog'

In [223]:
    test_str = "The quick brown fox jumped over the lazy dog"
    fox_idx = test_str.find("fox") #now fox_idx will equal 
    fox_idx
    
    print(test_str[fox_idx:]) ## Will printfox jumped over the lazy dog
In [220]:


fox jumped over the lazy dog


In [226]:
text = r.text

text_only = ""
while True:
    start = text.find("<p>")
    end = text.find("</p>")
    if start == -1:
        break

    p_text = text[start+3:end]
    text_only += p_text + "\n\n"
    text = text[end+4:]

print (text_only)
    

Text-Only NPR.org (go to <a href="https://www.npr.org">full version</a>)

<a href="/">Home</a> &gt; <a href="/p.php?pid=3"> Program: Morning Edition</a>

How An American Became Santa In A Little Town In France

By Eleanor Beardsley

Morning Edition,  &middot; Aurelie Garat still can't get over how she found her Pere Noel this year. She's the Christmas pageant organizer for the tiny Normandy town of Vimoutiers. 

"I was parking when I saw a nice young man with a beautiful beard sitting in his car on the phone," says Garat. "So I said to him, your beard speaks to me. Would you be our Father Christmas this year?"

Garat says they had been looking everywhere and had almost given up hope. She says it was as if this Pere Noel — as the French call Santa Claus — had just fallen out of the sky.

That heaven-sent Santa is 66-year-old retired American photographer Tom Haley, who happens to be fixing up a house he recently bought near Vimoutiers. Haley says he wasn't actually so surprised by Garat