# <center>Session 3 - Reading/Writing Data</center>

<hr>

## Table of Contents

<b>

* <a href="#Reading Text">Reading Text</a>
* <a href="#CSV">CSV</a>
* [Exercise 1](#Exercise 1)
* [Finding Multiple Files](#Glob)
* [Exercise 2](#Exercise2)
* <a href="#Writing">Writing</a>
* <a href="#Requests">Requests</a>
* [Exercise 3](#Exercise3)

</b>


<a id="Reading Text"></a>
<hr>
# Reading Text

Python comes with a built-in function 'open()', which takes the path (relative or absolute) to the file you're interested in opening.  

    # Example 
    file = 'data/text_file.txt'
    file_hdl = open(file)
    text = file_hdl.read()
    
    file_hdl.close() # Don't forget to close that file handle!!

In [None]:
file = 'data/text_file.txt'
file_hdl = open(file)
text = file_hdl.read()

file_hdl.close()

print(text)

Python has an more consise way to do this that doesn't require the developer to remember to close the file handle. 

    with open(file) as file_hdl:
        text = file_hdl.read()
        
When the indented block ends, python closes the file handle automatically. 

In [None]:
with open(file) as file_hdl:
    text_v2 = file_hdl.read()
    
print(text_v2)

<a id="CSV"></a>
<hr>
# CSV

A CSV (Comma Seperated Value) file is a lot like an excel workbook (and in fact can be read by excel).  Python has a library intended for reading and writing CSV files, and it relies on the built in open function. 

    import csv  # You only need to run this once per project
    
    csv_path = "data/quiz_questions.csv"
    
    with open(csv_path) as csvfile:
        csv_reader = csv.reader(csvfile)
        for row in csv_reader:
             print(row)
    
    

In [None]:
import csv  # You only need to run this once per project
    
csv_path = "data/quiz_questions.csv"
    
with open(csv_path) as csvfile:
    csv_reader = csv.reader(csvfile)
    for row in csv_reader:
         print(row)

Note that it pulls the header as the first row - Having a header be able to map to each row would be really useful.  The CSV library has a dictionary reader that reads each line in as a single dictionary, mapping the column header to the value in the row.  

    with open(csv_path) as csvfile:
        csv_reader = csv.DictReader(csvfile)
        for row in csv_reader:
             print(row)

In [None]:
with open(csv_path) as csvfile:
    csv_reader = csv.DictReader(csvfile)
    for row in csv_reader:
         print(row)

In [None]:
row['question']

<a id="Exercise 1"></a>

# Exercise 1

Take the dictionary we just read in using CSV, and ask the user for their answer.  If that answer is not mapped to the value 'correct' in the dictionary, print the hint and ask again. 

Look at the dictionary keys if you're not sure exactly where to begin!

In [1]:
## Your code here

<hr>
<a id="Glob"></a>
# Finding multiples files

If you have a folder with a lot of files, and you're looking for a generic pattern or extension, using a library like glob can make you life a lot easier. 

In the 'data/articles' folder there is a bunch of .txt files, each containing an article pulled from NPR.  There's also a .csv file containing additional information about the articles.  If we wanted to read ALL of the txt files, we can get a list of them in one fell swoop. 

In [None]:
import glob

path = "data/articles"
txt_list = glob.glob(path + "/*.txt")

print(len(txt_list))
print(txt_list[0])

<a id="Exercise2"></a>

# Exercise 2

Look at the file '/data/articles/article_log.csv', and get a sense of what it contains.  Note that the id maps to a filename (ie ID 569893288 maps to 569893288.txt)

Write some code that reads in each line in the article_log.csv as a dictionary, and add each dictionary to a larger dictionary, mapping the ID of the article to that dictionary for that line. See below for the results of reading in a single line. 


    {'569893288': {('ID', '569893288'),
               ('Date', '2017-12-14'),
               ('Title', "NPR's Favorite TV Shows Of 2017"),
               ('Link',
               'https://www.npr.org/sections/monkeysee/2017/12/14/569893288/nprs-favorite-tv-shows-of-2017')])}
               
Make sure to give the larger dictionary the variable name 'data_dict', as we'll use it later

In [None]:
## Your code here

### Exercise 2.1

Loop over all the article txt files you found.  For each file:

1. Get the ID from the filename
2. Read in the content
3. Get the following and add them to the dictionary for each article:
    3. Get the number of characters for each article
    4. Get the number of sentences for each article
    5. Get the number of paragraphs for each article
    
Potentially useful functions:

If you have a string as a variable, you can replace text in it, using .replace()
    
    For example, if you have the variable 
    
    test = "textfile.txt"
    test = test.replace(".txt", "")
    # test now equals 'textfile'
    
    test2 = "Sentence one. Sentence 2"
    split_test2 = test2.split(".")
    # split_test2 is now a list containing the string "Setence one" and " Sentence 2"
    
In a string variaible, a newline can be found using the '\n' character.  In many news articles, a paragraph can be indentified by two newline characters in a row. 

In [None]:
## Your code here

<a id="Writing"></a>

<hr>
# Writing to File

The open function can take more than just a filename. The second argument, which is optional, defaults to reading, but can have a number of other options.  You can tell it to append to a file, write to a file, write to a file ONLY if the file doesn't already exist, and more. 

In the case of writing to a file, we pass the 'w' argument.  

To write the CSV file with the adjusted dictionaries, we'll use the the DictWriter from the csv library, and a 'w' in our open call

In [None]:
file_out = "data/articles/article_log_improved.csv"
with open(file_out, 'w') as file_out_hdl:
    # This line is a bit confusing. 
    # We need to tell the writer WHERE to write, but also what the headers are. 
    # To get the headers, we need to get ANY single line dictionary, and get the keys for it
    writer = csv.DictWriter(file_out_hdl, next (iter (data_dict.values())).keys())
    
    writer.writeheader()

    for key,value in data_dict.items():
        writer.writerow(value)

<hr>
<a id="Writing"></a>

# Requests

One final way to get data is the requests library.  We can use it to get all the contant of a URL, whether that be HTML, JSON, or some other internet format. 

In [15]:
import requests

path = 'http://text.npr.org/s.php?sId=572945894'
r = requests.get(path)

In [3]:
type(r)

requests.models.Response

In [14]:
r.url

'http://text.npr.org/s.php?sId=572945894'

In [None]:
r.text

In [None]:
type(r.text)

In [18]:
weather_json = requests.get("https://api.weather.gov/points/38.8048,-77.0469").json()
forecast_url = weather_json["properties"]["forecast"]
forecast_json = requests.get(forecast_url).json()


In [25]:
for period in forecast_json["properties"]["periods"]:
    print(period["name"] + ":",
          period["shortForecast"], "and",
          str(period["temperature"]) + 
          period["temperatureUnit"])

Tonight: Mostly Clear and 38F
Wednesday: Mostly Sunny and 47F
Wednesday Night: Partly Cloudy and 30F
Thursday: Sunny and 43F
Thursday Night: Mostly Clear and 28F
Friday: Sunny and 50F
Friday Night: Mostly Clear and 35F
Saturday: Slight Chance Light Rain and 56F
Saturday Night: Chance Light Rain and 47F
Sunday: Light Rain Likely and 57F
Sunday Night: Chance Rain And Snow Showers and 37F
Monday: Chance Rain And Snow Showers and 47F
Monday Night: Slight Chance Rain Showers then Partly Cloudy and 27F
Tuesday: Sunny and 42F


<a id="Exercise3"></a>

# Exercise 3

Use string functions to print ONLY the text between the paragraph tags in the html. 

To find the starting index of a character sequence in a string, use the .find() function. 

    test_str = "The quick brown fox jumped over the lazy dog"
    fox_idx = test_str.find("fox") #now fox_idx will equal 16
    
    print(test_str[fox_idx:]) ## Will print 'fox jumped over the lazy dog'