## What You'll Accomplish in this Notebook

In this notebook you will:
<ul>
    <li>work through a basic pandas refresher</li>
    <li>learn about other file formats like tab delimited and json files</li>
    <li>see how to read raw csv files from the web</li>
    <li>review how to write data to file using pandas and base python</li>
</ul>

# Pandas and Data File Basics

In this first notebook we will discuss some basic data file types you'll encounter throughout the course and in your data science career. We'll also talk about one of the most popular data handling python packages `pandas`. 

Let's go!

In [None]:
# We'll first import pandas
# it is standard to import it as pd
import pandas as pd

## Reading in Data

### Common Delimited Files

#### CSVs

A csv file is a file where data values are separated by commas and new rows are separated by carriage returns.

To see what I mean open <a href="Data/iris.csv">iris.csv</a> in the Data Folder.

Now we'll see how to load this in using `pandas`.

In [None]:
# iris holds the pandas dataframe (df) object
# Using the default settings pandas reads in the first row as column headers
# All subsequent rows are read in as entries in the df
iris = pd.read_csv("iris.csv")

In [None]:
# df.head shows the first 5 rows
iris.head()

In [None]:
# df.tail shows the last 5 rows
iris.tail()

In [None]:
# What happens if you put a whole number in
# the parantheses of df.tail or df.head?
# Try that here
# If you already know the answer, great!







In [None]:
# What does df.sample() do?







`pandas` dataframes come with a number of useful features that we'll use throughout the course. Let's use the iris dataset to examine a few.

In [None]:
# df.describe()
iris.describe()

In [None]:
# df.column_name
# This produces a pandas series object, think of this like a vector
print(type(iris.petal_width))
print()
print(iris.petal_width)

In [None]:
# series.value_counts()
# This gives a count of the various values
iris.iris_class.value_counts()

### Practice

In [None]:
# See what df.mean() does





In [None]:
# What about df.max()?





In [None]:
# Now try the following
# df.groupby()
# Group the iris data by iris_class, then try to find
# the mean petal_width by class







#### Tab Delimited Files

This was a good start. Delimited files can be separated by things other than commas, for example tabs. Let's see an example of that.

In [None]:
# read_table is used for tab delimited files
fly = pd.read_table("FlyRNAi_data_baseline_vs_EGF.txt")

In [None]:
fly.head()

### JSON Files

JSON files are another popular way to store data. JSON stands for JavaScript Object Notation and is a standard format for data passed through the HTTP between web browsers and other applications. This format can be more complex and free-form than the prior two, but it is very similar to some of python's base data structures.

Let's start with an example.

In [None]:
# This imports the json package
# a package that is available in base python
import json 

# This opens the json file in read mode, and stores it in file
file = open("miserables.json","r")

# This stores the file as a python dictionary
mis = json.load(file)

# This closes the file
file.close()

In [None]:
print(type(mis))

In [None]:
mis

As we can see this particular data would be difficult to read in as a table the way it is stored. However, we can use `pandas` to create a dataframe from this data all the same.

In [None]:
# Here's a check, print the value corresponding to the
# nodes key for the mis dictionary





Okay now we are going to make a `pandas` dataframe from this dictionary.

In [None]:
# Write a script to extract the 'name' feature as its own list
# Call the list names

names = 




In [None]:
# Write a script to extract the 'group' feature as its own list
# Call the list groups

groups = 




In [None]:
# Now run this to create a miserables character dataframe
mis_df = pd.DataFrame({'name':names,'group':groups})

mis_df.head(10)

If your json file is arranged like a table you can just use `pandas` to read it in. Go read the file <a href = "Data/json_table.json">json_table.json</a> to see what file we're reading in.

In [None]:
df = pd.read_json("json_table.json")

In [None]:
df.head()

### Practice

Notice that `mis['nodes']` was in a tabular format. Write some code in the block below that makes a `pandas` dataframe without having to make lists. 

In [None]:
# Do your work here







### Reading Files Directly From a Website

Many popular websites host their data online as a raw csv or json file. For particularly large data it is preferable to do this over downloading the data onto your personal machine.

For example, <a href="https://fivethirtyeight.com/">https://fivethirtyeight.com</a> posts all of their data on their Github profile, <a href=https://github.com/fivethirtyeight>https://github.com/fivethirtyeight</a>. Let's use one of their data sets as an example.

We'll look at their candy data from the ultimate cand power ranking story, <a href="https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/">https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/</a>. Here's the link from their github, <a href="https://github.com/fivethirtyeight/data/tree/master/candy-power-ranking">https://github.com/fivethirtyeight/data/tree/master/candy-power-ranking</a>.

In [None]:
# Here is the raw csv file link
# from their github
url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv"

# we can read it in using pd.read_csv
candy = pd.read_csv(url)

In [None]:
# A sampling of candy
candy.sample(10)

This process of grabbing data off the web may come in handy when you work on your projects!

## Writing Data to File

We'll end the notebook with two ways we can write data to a file.

### to_csv with pandas

One easy way is to use the `to_csv` method in `pandas`.

In [None]:
# first make a dataframe
inputs = [1,2,3,4,5,6,7,8,9,10]
outputs = [2*i+11 for i in inputs]
df = pd.DataFrame({'input':inputs,'output':outputs})
df.head()

In [None]:
# Now try to_csv
df.to_csv("test.csv")

#### Practice

In [None]:
# Read in test.csv using pandas, then examine the first 5 entries







What happened?

We can address that by including the input `index = False` inside `to_csv`.

In [None]:
# Resave your csv without saving the index







Depending on your usage as well as the amount of data you store you may want to store in other formats besides a csv. If that is the case I encourage you to look at the `pandas` documentation here <a href="https://pandas.pydata.org/pandas-docs/stable/">pandas docs</a>, alternatively you can just google what you'd like to do and probably find an answer more quickly that way.

### Writing to File Manually Using python

Just as python allows you to read a file, it also allows you to write to a file. Let's see an example.

In [None]:
# open, will open a file object
# the w+ indicates that you'd like to 
# write on the file, and if the file doesn't exist
# python should create one for you
file = open("write_to_file.csv","w+")

# This will write some columns onto our file
# The \n tells python you want a new line
file.write("name,group\n")

In [None]:
# Now let's return to the les mis example
# You do some coding now
# For each character write their name and group to file








In [None]:
# Run this when you're done to close your file
file.close()

That's it!

There are additional file writing methods that are useful, check out section 7.2 of the python docs here, <a href="https://docs.python.org/3/tutorial/inputoutput.html">https://docs.python.org/3/tutorial/inputoutput.html</a>, for more information.