## Homework 01 - Instructions

- The goal for this lab will be to collect some data from you and get you some experience in working with Github and Jupyter Notebooks. We are going to also learn the basics of loading data into Python.
- We will go over importing data from a number of different file types.  



### Importing a CSV File
- This is obviously a very common technique which we will use over and over.  Luckily, someone has written a `csv` package that does the majority of the heavy lifting. 
- `import csv` imports all methods in the csv package.  
- We are going to import the values from the csv into a specific type of Python data structure, a `list`. Declaring `listcsv=[]` initializes the objects as a list and makes available the `append` method.  
- By using the `with open` syntax shown below, we don't have to open and then close the data structure. 

## CSV
- Comma delimited files are a common way of transmitting data. 
- Data for different columns is separated by a comma.
- Open the CSV file in a 

### Importing a CSV File
- The `open` command specifies the file name and what we want to do with it, here `r` stand for read.
- The `csvreader` is an object which iterates on the file. 
- We then read each `row` using a `for` loop. 

In [1]:
#This is an example of how to import a CSV into a Python list. 
import csv  #This imports the CSV package.
listcsv=[] #This initializes a list data structure.

with open('in/name.csv', 'r') as data_file:   #The "with" incorporates an open and close of file. 
    csvreader = csv.reader(data_file, delimiter=',')
    for row in csvreader:
        print("each row of the reader imported as a:", type(row), row,"\n")
        listcsv.append(row)


each row of the reader imported as a: <class 'list'> ['\ufefffirst-name', 'last-name', 'email-rpi', 'email-other', 'github-userid', 'slack-userid'] 

each row of the reader imported as a: <class 'list'> ['Jason', 'Kuruzovich', 'kuruzj@rpi.edu', 'jkuruzovich@gmail.com', 'jkuruzovich', 'jason_kuruzovich'] 



### Reassigning variables programiatically. 
- `listcsv` is a 2 dimensional list, with the first number indicating the row and the second number indicating the column.
- Objects start numbering at 0, so that in this case the header-row is 0. 


In [2]:
#q1. Updated ALL values of list  (with your infoormation) (1 pt).
#Here is how you update your list.  Updated ALL values.  
listcsv[1][0] = "Yirong"   #row/column numbers start at 0
listcsv[1][1] = "Tang"    #row/column numbers start at 0
listcsv[1][2] = "tangy8@rpi.edu"
listcsv[1][3] = "tyrvivian@hotmail.com"
listcsv[1][4] = "tyrvivian"
listcsv[1][5] = "yirong tang"
#...(update the rest of the variables)


### Output the CSV File
- Here, notice we are doing just the same thing as reading. However, we are doing it by opening with a `w`. 

In [3]:
#Here we are going to save as a tab delimited file
with open("out/name.csv", 'w' ) as outfile:
    writer = csv.writer(outfile, delimiter=',')
    writer.writerows(listcsv)

## Writing a Tab Delimited file
- Here we are able to output the file as a tab delimited file.  
- A tab delimed file utilizes '\t' as the delimiter.
- Update the file below. 

In [7]:
#Here we are going to save as a tab delimited file
with open("out/name.txt", 'w') as outfile:
    writer = csv.writer(outfile, delimiter='\t')
    writer.writerows(listcsv)

### Importing CSV into a Pandas Dataframe
- Data structured like CSV's is extremely common
- We are going to use a special package called Pandas which will give access to many useful methods for working with data.  
- `pandas` is often imported as the abbreviated `pd`.
- Typing the object name of a pandas dataframe (here `dfcsv`) gives a *pretty printed* version of the table.

In [8]:
# This will load the local name.csv file into a Pandas dataframe.  We will work with these a lot in the future.
import pandas as pd # This line imports the pandas package. 
dfcsv = pd.read_csv('in/name.csv')
dfcsv

Unnamed: 0,first-name,last-name,email-rpi,email-other,github-userid,slack-userid
0,Jason,Kuruzovich,kuruzj@rpi.edu,jkuruzovich@gmail.com,jkuruzovich,jason_kuruzovich


### Notice the Pandas Magic! 
- Pandas figured out that you have columns and even knows the rows. 
- We can update the files via loc or i-loc. 

In [9]:
# Here we have 2 ways of updating the file. First 
dfcsv.loc[0, 'first-name'] = 'yirong'
dfcsv.loc[0, 'last-name'] = 'tang'
dfcsv.loc[0, 'email-rpi'] = 'tangy8@rpi.edu'
dfcsv.loc[0, 'email-other'] = 'tyrvivian@hotmail.com'
dfcsv.loc[0, 'github-userid'] = 'tyrvivian'
dfcsv.loc[0, 'slack-userid'] = 'yirong tang'
dfcsv

Unnamed: 0,first-name,last-name,email-rpi,email-other,github-userid,slack-userid
0,yirong,tang,tangy8@rpi.edu,tyrvivian@hotmail.com,tyrvivian,yirong tang


In [None]:
# This just utilizes the integer position. 
dfcsv.iloc[0, 0] = 'your first via iloc'
dfcsv.iloc[0, 1] = 'your last via iloc'
dfcsv

In [10]:
#Update the remainder of the information using the loc method of Pandas. 
# Notice how we can just as easily write the file. 
dfcsv = dfcsv.to_csv('out/namepd.csv')

## JSON
- Javascript object notation.  
- This enables multipled layers of nesting, something that could take multiple files in a CSV or relational tables.
- JSON is often used for APIs.
- Our JSON is imported as a `dictionary`, which is another internal type of Python data structure.


In [11]:
import json   #This imports the JSON
from pprint import pprint  #This will print the file in a nested way. 

with open('in/name.json') as data_file:   #The "with" incorporates an open and close of file.   
    datajson = json.load(data_file)

print("data is a python object of type: ", type(datajson),"\n")
pprint(datajson) #Pretty printing (pprint) makes it easier to see the nesting of the files. 


data is a python object of type:  <class 'dict'> 

{'student': [{'email-other': 'jkuruzovich@gmail.com',
              'email-rpi': 'kuruzj@rpi.edu',
              'first-name': 'Jason',
              'github-userid': 'jkuruzovich',
              'last-name': 'Kuruzovich',
              'slack-userid': 'jason_kuruzovich'}]}


In [12]:
#Here is how you update the dictionary: 
#We are indicating that we want the first student, and from there we list which "key" for 
#the dictionary we want (i.e., 'first-name').
# Update the rest of the information with your 

datajson['student'][0]['first-name']  = 'Yirong'
datajson['student'][0]['last-name']  = 'Tang'
datajson['student'][0]['email-other']  = 'tyrvivian@hotmail.com'
datajson['student'][0]['email-rpi']  = 'tangy8@rpi.edu'
datajson['student'][0]['github-userid']  = 'tyrvivian'
datajson['student'][0]['slack-userid']  = 'yirong tang'
pprint(datajson)


{'student': [{'email-other': 'tyrvivian@hotmail.com',
              'email-rpi': 'tangy8@rpi.edu',
              'first-name': 'Yirong',
              'github-userid': 'tyrvivian',
              'last-name': 'Tang',
              'slack-userid': 'yirong tang'}]}


In [13]:
import json   #This imports the JSON
from pprint import pprint  #This will print the file in a nested way. 

with open('out/name.json', 'w') as data_file:   #The "with" incorporates an open and close of file.   
    json.dump(datajson, data_file)

## Parquet Files
- CSV files are great for humans to read and understand.  
- For "big data" though, it isn't a great long term storage option (inefficient/slow).
- Parquet is a type columnar storage format.  It makes dealing with lots of columns fast. 
- [fastparquet](https://fastparquet.readthedocs.io) is a Python package for dealing with Parquet files. 
- Apache Spark also natively reads Parquet Files. 

In [14]:
from fastparquet import ParquetFile
pf = ParquetFile('in/name.parq')
dfparq = pf.to_pandas()
dfparq

Unnamed: 0,first-name,last-name,email-rpi,email-other,github-userid,slack-userid
0,Jason,Kuruzovich,kuruzj@rpi.edu,jkuruzovich@gmail.com,jkuruzovich,jason_kuruzovich


In [15]:
#Update the dfparq dataframe (same code as Pandas)
datajson['student'][0]['first-name']  = 'Yirong'
datajson['student'][0]['last-name']  = 'Tang'
datajson['student'][0]['email-other']  = 'tyrvivian@hotmail.com'
datajson['student'][0]['email-rpi']  = 'tangy8@rpi.edu'
datajson['student'][0]['github-userid']  = 'tyrvivian'
datajson['student'][0]['slack-userid']  = 'yirong tang'
pprint(datajson)

{'student': [{'email-other': 'tyrvivian@hotmail.com',
              'email-rpi': 'tangy8@rpi.edu',
              'first-name': 'Yirong',
              'github-userid': 'tyrvivian',
              'last-name': 'Tang',
              'slack-userid': 'yirong tang'}]}


In [16]:
# We can similarly easily write to a .parq file. 
from fastparquet import write
write('out/name.parq', dfparq)

## Homework Rubric
The following Rubric will be used to grade homwork. 
q1. Updated `out/name.csv` file with your infoormation  (2 pt).<br>
q2. Updated `out/name.txt` file with your infoormation  (2 pt).<br>
q3. Updated `out/name.csv` file with your infoormation  (2 pt).<br>
q4. Updated `out/name.json`  file with your infoormation  (2 pt).<br>
q5. Updated `out/name.parq` parquet file (2 pt)<br>
q6. Review the [pandas documentation](https://pandas.pydata.org) and descripe 3 cool things that Pandas can do.  (2 pt) <br>
q7. Describe the difference between the loc and iloc with accessing a Pandas dataframe. 
q8. Let's say you had this sample data in json.  Show (conceptually) how you would go about changing this to a CSV file. Your output should just list the structure of the data file. 
```{json}
myObj = {
    "name":"John",
    "age":30,
    "cars": {
        "car1":"Ford",
        "car2":"BMW",
        "car3":"Fiat"
    }
 }

```
q9.  Read through the [fastparquet documentation](https://fastparquet.readthedocs.io) 
q10. Agreement with statement: "I work through this entire homework step by step."

In [17]:
#Notice how we can use 3 quotes. 
q6 = """q6.  
1. .groupby() allows to reference either column names or index level names, 
so the column names and index level names could be grouped together, which is much easier.

2. it discard the panel and use 2-level MultiIndexed DataFrame when useing the
.rolling(..), .expanding(..), or .ewm(..), so the new talbe are easier to read.

3. the pandas create a new type of numerical index,UInt64Index; and it could support more 
operations of unsigned, or purely non-negative, integers which was not supported before.
   

"""

In [18]:
q7 = """q7.
'loc' works on labels in the index.
'iloc' works on the positions in the index (so it only takes integers).




"""

In [19]:
q8 = """q8.
import csv
import json 


myObj=[
    {
    "name":"John",
    "age":30,
    "cars": {
        "car1":"Ford",
        "car2":"BMW",
        "car3":"Fiat"
        }
    }
    ]
    
myObj1=json.dumps(myObj)
myObj_data = json.loads(myObj1)
f = csv.writer(open("test_str.csv", "w"))

#csvwriter = csv.writer(myObj)

for myObj in myObj_data:
    f.writerow([myObj["name"],
    myObj["age"],
    myObj["cars"]["car1"],
    myObj["cars"]["car2"],
    myObj["cars"]["car3"]
               ])

"""

In [20]:
q9 = """q9.
The reading is finished

"""

In [21]:
#q10 I work throug this entire homework step by step. Change to True if you did. 

q10= "True"

In [22]:
answers= [q6,q7,q8,q9,q10]
with open('out/answers.txt', 'w') as outfile:   #The "with" incorporates an open and close of file. 
    outfile.write("\n".join(answers))




This work is licensed under the [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/) license agreement.