# Loading and Describing Data Using Pandas

In this notebook, we will learn how to read data in from files (csv, tab delimited, etc..). We will learn about some of the common problems that arise and their solutions. We will also learn how to do some basic data exploration and description.

## Notebook Outline:

* <a href='#IntroToPandas'>Introduction To Pandas</a>
* <a href='#LoadingStandardCSV'>Loading Standard CSV</a>
* <a href='#BasicDataDescription'>Basic Data Description</a>
* <a href='#LoadingTabDataFile'>Loading Tab Delimited Files</a>
* <a href='#LoadingWhiteSpaceFile'>Loading White Space Delimited Files</a>
* <a href='#WritingOutToCSV'>Writing Data Out To A CSV</a>
* <a href='#LessonSummary'>Lesson Summary</a>

# How to use this Notebook

The best way to use this notebook is to follow along with the lecture and then to apply what you learn to your own data files, or (if you do not have any of your own data) to practice using this functions and methods on the provided data. A little practice goes a long way towards understand and retaining! It would be easy to just skim this notebook, but you will learn more by doing!

#  Introduction To Pandas - What is Pandas?
### From the Pandas website:
http://pandas.pydata.org/

'pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.'

If you are familiar with Excel, think of Pandas as a similar tool to explore and analyze data. There are big differences between Pandas and Excel (Pandas is faster, can handle larger datasets more efficiently, and can do more overall, but does not have GUI), but they can be used for similar purposes and having that comparison in your mind may help you digest the information.

### My experience with Pandas:

I use Pandas everyday, along with Jupyter Notebook, to explore and analyze client data. It is an integral part of my real-world-workflow.

<a name='LoadingStandardCSV'></a>
# Loading A Standard CSV File With Pandas
In the below cells, we import pandas. Then, we load a data file of *real data* for a group of 7 popular fast food stores. We will walk through the data in the lecture. Please see the comments in each cell below for more details about the code in each cell. 

Also, we will be using the `read_csv()` method extensively, and introducing some of its arguments.  If you'd like, you can refer to its documentation here <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html>

In [1]:
# First we must import pandas.  It is very common to import pandas as pd.  All
# this means is that I can refer to pandas as 'pd' in my code - saving myself
# from typing 4 more characters and also saving space.

import pandas as pd
import os

In [2]:
# Next, we need to define the filepath to our file.  We use the os library to
# define the path dynamically, this way it should load on all students
# computers. os.getcwd() will return the path to the directory that contains
# the notebook. os.path.join will join all the elements of the path with the 
# correct separator ('\' or '/') depending on if you are running on windows or
# mac/linux


filepath = os.path.join(os.getcwd(), 'data', 'ShiftManagerApp_LaborSheet.csv')

In [3]:
# Now we can load our data.  It is pretty simple, we just use the read_csv()
# method. Method docs can be found here:
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

print(filepath)
store_data = pd.read_csv(filepath)

/Users/williamhenry/Documents/1_Projects/E-Trade/Course Files/Python For Data Analysis Course Files/data/ShiftManagerApp_LaborSheet.csv


In [4]:
# Let's print the type of the object we just created
print(type(store_data))

<class 'pandas.core.frame.DataFrame'>


<a name='BasicDataDescription'></a>
# Basic Data Description

Now we start our basic data description!  Unfortunately, "basic" can sometimes sound like something is not interesting, or not "the good stuff" - this is not true in this case. It is relatively simple, but it is very important to have a solid high-level understanding of your data before you dive in deeper. If you skip this, you will end up paying for it later.

## Methods To Help Get A Quick Look At The Data:
* `head()` - print the first `n` lines
* `tail()` - print the last `n` lines
* `sample()` - print a random `n` lines

#### The .head() method can be used to get the first n lines of a dataframe. It is always a good idea to just 'look' at your data.

In [5]:
# Below we print the first 3 lines of the data file. The default number of
# lines printed is 5
store_data.head(3)

Unnamed: 0,Store_ID,Manager,Date,Ending_Hour,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage
0,4462,JillianA,2017-01-23,08:00:00,540.0,420.0,170.0,,100.0,,,,,,2017-01-23 09:52:14,,
1,4462,ZoeyD,2017-02-05,06:00:00,90.0,155.0,114.0,,78.0,,,,,,2017-02-05 11:30:48,,
2,4462,JessicaB,2017-02-05,07:00:00,173.0,182.0,106.0,,81.0,,,,,,2017-02-05 11:35:48,,


#### The .tail() method can be used to get the last n lines of a dataframe.

In [6]:
store_data.tail()

Unnamed: 0,Store_ID,Manager,Date,Ending_Hour,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage
25466,31225,ChristopherP,2018-07-28,19:00:00,173.0,,,,,5.0,,,,,2018-07-28 09:53:37,,
25467,31225,ChristopherP,2018-07-28,20:00:00,110.0,,,,,3.0,,,,,2018-07-28 09:53:57,,
25468,31225,ChristopherP,2018-07-28,21:00:00,89.0,,,,,3.0,,,,,2018-07-28 09:54:12,,
25469,31225,ChristopherP,2018-07-28,22:00:00,100.0,,,,,3.0,,,,,2018-07-28 09:54:29,,
25470,31225,ChristopherP,2018-07-28,23:00:00,102.0,,,,,3.0,,,,,2018-07-28 09:54:44,,


We just learned something by looking at the data - it looks like the names were entered in all caps in some of the data. This will be important later.

#### The .sample() method can be used to get a random smaple of n rows from the dataframe.

In [7]:
store_data.sample(10)

Unnamed: 0,Store_ID,Manager,Date,Ending_Hour,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage
18740,18065,VeronicaC,2018-02-01,19:00:00,646.0,519.0,333.0,46.0,,12.5,11.5,,,Veronica C,2018-02-01 19:14:15,,
7024,4587,TommyA,2018-06-01,19:00:00,675.0,,,,,,,,,,2018-06-01 04:15:07,,
10174,11794,ErinS,2017-11-01,07:00:00,233.0,356.0,112.0,,31.0,,,,,,2017-11-01 11:00:02,,
12052,11794,JessicaM,2018-04-21,09:00:00,455.0,645.92,252.0,,205.0,8.5,8.5,,,Carrie stepp,2018-04-21 09:17:01,,
13570,11969,ArielA,2017-04-03,09:00:00,640.0,659.0,134.0,,85.0,,,,,,2017-04-03 09:03:53,,
22413,31225,KristinB,2017-12-27,20:00:00,147.0,177.0,175.0,,194.0,,,,,Kristin Bashe,2017-12-27 11:00:03,,
18968,18065,SabrinaD,2018-02-14,07:00:00,196.0,286.0,144.0,39.0,56.0,7.0,7.0,,,Sabrina D,2018-02-14 07:18:04,,
9666,11794,ChristinaS,2017-06-03,17:00:00,343.0,311.0,134.0,,107.0,,,,,,2017-06-03 17:23:10,,
19677,18065,OliviaP,2018-03-30,17:00:00,564.0,340.0,231.0,47.0,119.0,9.5,8.0,,,Olivia,2018-03-30 17:12:34,,
24150,31225,HaileyB,2018-05-06,10:00:00,146.0,136.0,131.0,34.0,86.0,3.5,3.5,,,,2018-05-06 11:09:44,,


## Understanding the Shape and Variables Types of The Data
* `shape` - an attribute that tells use the number of rows and columns in a dataset
* `columns` - an attribute the returns the names of the columns (as an index)
* `info()` - a method that returns information about the dataframe (see below)
* `memory_usage()`- a method that returns the overall memory usage of the dataframe and each individual column

#### The .shape _attribute_ will tell us the size of the file; the number of rows and the number of columns.

In [9]:
store_data.shape

(25471, 17)

#### The .columns attribute will tell us the names of the columns.

In [10]:
store_data.columns

Index(['Store_ID', 'Manager', 'Date', 'Ending_Hour', 'Projected_Sales',
       'Sales', 'DT_TTL', 'Car_Count', 'KVS_Total', 'Scheduled_People',
       'Actual_People', 'Reason_for_Labor_Diff', 'Reason_for_High_TTLs',
       'Manager_Entering_Data', 'Timestamp', 'OEPE', 'Park_Percentage'],
      dtype='object')

#### The .info() method will tell use quite a bit about our dataframe: the datatypes, the number of non-null values, and the memory usage

* The 'object' data is used for strings or other variable types that are not numbers or dates.  For example, lists or tuples, which can be stored in a dataframe, but that is rare - most of the time, when you see 'object' it means the column contains strings.
 
* A null entry would be one that is _empty_ in the dataset.  Remember that sometimes the dataset already comes with null or missing values marked with a special value, like -9999 (we will see this in the weather data example). Pandas will not immediately recognize this as a null value.

In [11]:
store_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25471 entries, 0 to 25470
Data columns (total 17 columns):
Store_ID                 25471 non-null int64
Manager                  25471 non-null object
Date                     25471 non-null object
Ending_Hour              25471 non-null object
Projected_Sales          25316 non-null float64
Sales                    23678 non-null float64
DT_TTL                   23647 non-null float64
Car_Count                14529 non-null float64
KVS_Total                23608 non-null float64
Scheduled_People         16175 non-null float64
Actual_People            16068 non-null float64
Reason_for_Labor_Diff    553 non-null object
Reason_for_High_TTLs     247 non-null object
Manager_Entering_Data    11057 non-null object
Timestamp                25471 non-null object
OEPE                     0 non-null float64
Park_Percentage          0 non-null float64
dtypes: float64(9), int64(1), object(7)
memory usage: 3.3+ MB


#### The .memory_usage() method gives the size of each column in bytes.
Note that if you add these together and divide by 1024 (1024 bytes = 1 KB), you get the same number that is shown in the output from .info()

In [12]:
store_data.memory_usage()

Index                       128
Store_ID                 203768
Manager                  203768
Date                     203768
Ending_Hour              203768
Projected_Sales          203768
Sales                    203768
DT_TTL                   203768
Car_Count                203768
KVS_Total                203768
Scheduled_People         203768
Actual_People            203768
Reason_for_Labor_Diff    203768
Reason_for_High_TTLs     203768
Manager_Entering_Data    203768
Timestamp                203768
OEPE                     203768
Park_Percentage          203768
dtype: int64

## Statistical Description Of The Data
* `describe()` - a method that returns basic statistics about all numerical columns
* `nunique()` - a method that returns the number of unique values in each column. It also can be used on a single column.
* `unique()` - a column method that returns the unique values of a column (must be used on a single column)
* `value_counts()` - a column method that returns the counts of each unique value in a column (must be used on a single column)

#### The .describe() method outputs basic descriptive statistics about all of the _numerical_ columns in the dataframe.

In [13]:
store_data.describe()

Unnamed: 0,Store_ID,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,OEPE,Park_Percentage
count,25471.0,25316.0,23678.0,23647.0,14529.0,23608.0,16175.0,16068.0,0.0,0.0
mean,13434.155549,433.190017,445.689741,246.710915,55.969096,135.200779,8.900309,8.280651,,
std,8666.521776,240.989063,2114.620848,212.535401,551.387031,94.207054,4.128854,3.784632,,
min,4462.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,
25%,4587.0,230.0,230.0,171.0,32.0,82.0,5.0,5.0,,
50%,11794.0,428.0,415.985,226.0,47.0,119.0,8.5,8.0,,
75%,18065.0,583.0,581.29,294.0,61.0,167.0,12.0,11.0,,
max,31225.0,4553.0,321145.0,24347.0,44196.0,6670.0,56.0,45.0,,


#### The .unique() method will output the unique values in a column.
In order to get a column from a dataframe, simple put the column name in square brackets after the dataframe variable. For example, we use nameData['Name'] below to get the name column of the dataframe. (We will cover indexing and slicing of dataframes in greater detail in a following lesson.)

In [14]:
store_data['Store_ID'].unique()

array([ 4462,  4587, 10523, 11794, 11969, 18065, 31225])

#### The .nunique() method will output the number of uniuqe values in a column

In [15]:
store_data['Manager'].nunique()

124

#### The .value_counts() method will output the number of times each value occurs in a column. 

In [16]:
print(store_data['Manager'].value_counts())

DianeA          828
JessicaA        733
CeaunnaS        691
ZoilaO          687
ChristopherP    685
               ... 
NicholeS          2
BrittanyS         2
Erin              2
ConnieG           1
DeannaG           1
Name: Manager, Length: 124, dtype: int64


## Brief summary of what you have learned so far:
* head(n) - get the first n rows
* tail(n) - get the last n rows
* sample(n) - get a random sample of n rows
* shape - get the number of rows and columns
* columns - get the column names
* dtypes - get the variable types of each column
* info() - get the variables types, non-null counts, and memory size of the DataFrame
* memory_usage() - get the memory usage of each column of the data frame
* describe() - get basic summary statistics about each numerical column
* unique() - get the unique values in a column
* nunique() - get the number of unique values in a column
* value_counts() - get the occurence counts for each value in a column

# How To Not Read In *All* The Data

We can use the `nrows` and `usecols` argument to only read in a certain number of rows and only certain columns. Save memory and only read in what you need! Use a smaller number of rows while prototyping an analysis report!

In [17]:
store_data_subset = pd.read_csv(filepath, nrows=100, usecols=['Store_ID', 'Manager'])
print(store_data_subset.shape)
store_data_subset.head()

(100, 2)


Unnamed: 0,Store_ID,Manager
0,4462,JillianA
1,4462,ZoeyD
2,4462,JessicaB
3,4462,JessicaB
4,4462,JessicaB


<a name='LoadingTabDataFile'></a>
# Reading A New Data File: Auto MPG Data With Tab Separated Fields

Now, we will now load a new data file and practice what we have learned so far. We will also introduce the following new arguments to the `read_csv()` method:

* `sep` - this argument allows us to specify the field separator in our data file
* `index_col` - this argument allows us to specify which column of data is the index (if file contains an index)
* `na_values` - this argument allows us to specify a list of values that should be counted as null values


This is a dataset of car models made from 1970 to 1982. The dataset includes the following attributes of each model: The mpg, number of cylinders, engine displacement, horsepower, weight, acceleration (m/s^2), model year and car name.

### Introducing the `sep` argument in the `read_csv()` method.
The `sep` argument allows us to specify the field separator that pandas should use when attempting to read in the data. Below, we set it to the tab escape sequence which is '\t'. (This just means that '\t' indicates a tab). Note that the default value for the `sep` argument is ',' which is why we do not have to set it when reading in comma separated data.

In [18]:
filepath = os.path.join(os.getcwd(), 'data', 'auto-mpg-tabs.csv')
autoMPGData = pd.read_csv(filepath, sep='\t')

#### We will now use the .head() method to look at our data

In [19]:
# Exercise: Fill in the line below to use the head() method
autoMPGData.head()

Unnamed: 0.1,Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
0,0,18.0,8,307.0,130.0,3504.0,12.0,70,chevrolet chevelle malibu
1,1,15.0,8,350.0,165.0,3693.0,11.5,70,buick skylark 320
2,2,18.0,8,318.0,150.0,3436.0,11.0,70,plymouth satellite
3,3,16.0,8,304.0,150.0,3433.0,12.0,70,amc rebel sst
4,4,17.0,8,302.0,140.0,3449.0,10.5,70,ford torino


### Introducing the `index_col` argument to the `read_csv()` method.

Notice the first column 'Unnamed: 0'. The reason we see this in the dataframe is because this file already has an index column (see the screenshot below).  Pandas always automatically adds its own index column. So, it treats the index column in the file as a column of data. Since this column has no header in the file, it gives it a generic heading of 'Unnamed: 0'. We can use the 'index_col' argument when reading in a csv to indicate which column, already present in the datafile, we would like to use as the index.  In this case, we want to use the first column. Remember that Python is zero-indexed, so the first column will be column 0.

In [20]:
# Note how we use the index_col argument to read in the first column, in the data file, as the index.
autoMPGData = pd.read_csv(filepath, sep='\t', index_col=0)
autoMPGData.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
0,18.0,8,307.0,130.0,3504.0,12.0,70,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,ford torino


### Using `.shape`, `.info()` and `describe()` to better understand the data set.
Notice below how the horsepower data type is 'object' and not 'int64' or 'float64'.  Horsepower is a number, so we would expect the datatype to be an int or float.  But pandas as recognized it as 'object' (which means that pandas has recognized the column as a column of strings).  This is unexpected, and means that there probably is a string in the data! We will see what it is using some of the other methods we have learned.

Exercise: fill in the cells below to use the `shape` attrivute, and the `info` and `describe` methods.

In [21]:
autoMPGData.shape

(398, 8)

In [25]:
autoMPGData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 398 entries, 0 to 397
Data columns (total 8 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      398 non-null object
weight          398 non-null float64
accelartion     398 non-null float64
model year      398 non-null int64
carname         398 non-null object
dtypes: float64(4), int64(2), object(2)
memory usage: 28.0+ KB


In [23]:
autoMPGData.describe()

Unnamed: 0,mpg,cylinders,displacement,weight,accelartion,model year
count,398.0,398.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,2970.424623,15.56809,76.01005
std,7.815984,1.701004,104.269838,846.841774,2.757689,3.697627
min,9.0,3.0,68.0,1613.0,8.0,70.0
25%,17.5,4.0,104.25,2223.75,13.825,73.0
50%,23.0,4.0,148.5,2803.5,15.5,76.0
75%,29.0,8.0,262.0,3608.0,17.175,79.0
max,46.6,8.0,455.0,5140.0,24.8,82.0


### Using the `na_values` argument to recognize custom null values when reading in data

We see below that null horsepower values are specified as '?' in the data.  There are a few ways to deal with this.

Let's now use the `unique()` method to spot the null value. This is not the only way to find the bad value. But this is one way, using a method we have learned so far. We will see some other possibilities in coming lectures.

In [29]:
autoMPGData["horsepower"].unique()

array(['130.0', '165.0', '150.0', '140.0', '198.0', '220.0', '215.0',
       '225.0', '190.0', '170.0', '160.0', '95.00', '97.00', '85.00',
       '88.00', '46.00', '87.00', '90.00', '113.0', '200.0', '210.0',
       '193.0', '?', '100.0', '105.0', '175.0', '153.0', '180.0', '110.0',
       '72.00', '86.00', '70.00', '76.00', '65.00', '69.00', '60.00',
       '80.00', '54.00', '208.0', '155.0', '112.0', '92.00', '145.0',
       '137.0', '158.0', '167.0', '94.00', '107.0', '230.0', '49.00',
       '75.00', '91.00', '122.0', '67.00', '83.00', '78.00', '52.00',
       '61.00', '93.00', '148.0', '129.0', '96.00', '71.00', '98.00',
       '115.0', '53.00', '81.00', '79.00', '120.0', '152.0', '102.0',
       '108.0', '68.00', '58.00', '149.0', '89.00', '63.00', '48.00',
       '66.00', '139.0', '103.0', '125.0', '133.0', '138.0', '135.0',
       '142.0', '77.00', '62.00', '132.0', '84.00', '64.00', '74.00',
       '116.0', '82.00'], dtype=object)

#### We can now specify a list of `na_values` in our dataset.

Let's use the `na_values` argument to specify a list of custom values for pandas to treat as null when reading in the data.  We will then use the `info()` method to confirm that these values are being read in as null.

In [30]:
autoMPGData = pd.read_csv(filepath, sep='\t', index_col=0, na_values=['?'])
autoMPGData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 398 entries, 0 to 397
Data columns (total 8 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      392 non-null float64
weight          398 non-null float64
accelartion     398 non-null float64
model year      398 non-null int64
carname         398 non-null object
dtypes: float64(5), int64(2), object(1)
memory usage: 28.0+ KB


<a name='LoadingWhiteSpaceFile'></a>
# Reading A New Data File: ISD Weather Data Delimited By White Space
We will now look at one more data file. This file is from the isd-lite data that can be found here: <ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite>

These files contain weather observations from weather stations all over the world.  We will look at the 2001 data for the station 724080-13739 which is a station at the Philadelphia International Airport.

This particular data is delimited by white space. White space can mean a number of things: tabs, spaces, new lines.  In this case it just means spaces; see the screen shot below.

We will learn:
* `delim_whitespace` - an argument to the `read_csv()` method that allows to easily read in data delimited by white spaces.
* `header` - an argument to specify if the file contains column names or not
* `names` - an argument to the `read_csv()` method that allows us to specify the column names when reading in the file 

#### Introducing the `delim_whitespace` argument
We can use a special argument when a datafile is separated by an undetermined amount of white space. That is, field could be separated by different number of spaces, or tabs and spaces etc..

In [31]:
filepath = os.path.join(os.getcwd(), 'data', 'Philadelphia_Pennsylvania_USA/724080-13739-2001')
weatherData = pd.read_csv(filepath, delim_whitespace=True)

In [32]:
weatherData.head()

Unnamed: 0,2001,01,01.1,00,-6,-94,10146,280,57,2,0,-9999
0,2001,1,1,1,-11,-94,10153,280,57,4,0,-9999
1,2001,1,1,2,-17,-106,10161,290,62,2,0,-9999
2,2001,1,1,3,-28,-100,10169,260,57,0,0,-9999
3,2001,1,1,4,-28,-100,10177,260,52,0,0,-9999
4,2001,1,1,5,-44,-100,10182,250,52,0,0,-9999


#### Using the `header` argument
Another method is to use the `header` argument to prevent pandas from automatically treating the first row of data as columns names. See below how we set the header argument to None. The default value is 0, which means that pandas will try to read the first row as the header of the data file (the column names).  Remember that python is zero-indexed, so a value of 0 indicates the first row. By setting header to None we are "telling" .read_csv() that it should not treat any row as the headers when reading the file, and it will just number the columns 0 through 11.

In [33]:
weatherData = pd.read_csv(filepath, delim_whitespace=True, header=None)
weatherData.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,2001,1,1,0,-6,-94,10146,280,57,2,0,-9999
1,2001,1,1,1,-11,-94,10153,280,57,4,0,-9999
2,2001,1,1,2,-17,-106,10161,290,62,2,0,-9999
3,2001,1,1,3,-28,-100,10169,260,57,0,0,-9999
4,2001,1,1,4,-28,-100,10177,260,52,0,0,-9999


#### How to set column names by using the `names` argument when we read in the data.
If we know what the column names should be, we can pass them to the names argument as a list, and pandas will automatically apply the names to the columns when it reads in the data, and it will treat the first row in the file as data.

We know what the column names should be, by looking at the data documentation which is here: <ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/isd-lite-format.pdf>

In [34]:
column_names = ['Year', 'Month', 'Day', 'Hour', 'Air Temp', 'Dew Point Temp',
           'Sea Level Pressure',
           'Wind Direction', 'Wind Speed Rate',
           'Sky Condition Total Coverage Code',
           'Liquid Precipitation Depth Dimension - 1Hr Duration',
           'Liquid Precipitation Depth Dimension - Six Hour Duration']
print(column_names)
weatherData = pd.read_csv(filepath, delim_whitespace=True, names=column_names)

['Year', 'Month', 'Day', 'Hour', 'Air Temp', 'Dew Point Temp', 'Sea Level Pressure', 'Wind Direction', 'Wind Speed Rate', 'Sky Condition Total Coverage Code', 'Liquid Precipitation Depth Dimension - 1Hr Duration', 'Liquid Precipitation Depth Dimension - Six Hour Duration']


In [35]:
weatherData.head(2)

Unnamed: 0,Year,Month,Day,Hour,Air Temp,Dew Point Temp,Sea Level Pressure,Wind Direction,Wind Speed Rate,Sky Condition Total Coverage Code,Liquid Precipitation Depth Dimension - 1Hr Duration,Liquid Precipitation Depth Dimension - Six Hour Duration
0,2001,1,1,0,-6,-94,10146,280,57,2,0,-9999
1,2001,1,1,1,-11,-94,10153,280,57,4,0,-9999


#### What are the -9999 values?
You have probably noticed the -9999 values in the 'Liquid Precipitation Depth Dimension - Six Hour Duration' column.  Without knowing anything more, we should be very suspicious that this is a special value indicating a missing value.  If we look at the data documentation linked in a previous cell, we will see that -9999 is used as a missing value.  We will come back to missing values in a future lecture, and we will specifically look at this example.  For now, we note it and move on.

#### In Class Exercise:

In the cell below, use the `na_values` argument we learned about early to read in the weather data file and treat the -9999 values as null values.

In [41]:
weatherData = pd.read_csv(filepath, delim_whitespace=True, names=column_names, na_values=[-9999])

#### Using .shape, .dtypes, .info(), and .describe() to take a closer look at the weather data.

Exercise: In the below cell, use these methods to get a closer look at the dataframe

In [42]:
weatherData.head()

Unnamed: 0,Year,Month,Day,Hour,Air Temp,Dew Point Temp,Sea Level Pressure,Wind Direction,Wind Speed Rate,Sky Condition Total Coverage Code,Liquid Precipitation Depth Dimension - 1Hr Duration,Liquid Precipitation Depth Dimension - Six Hour Duration
0,2001,1,1,0,-6.0,-94.0,10146.0,280.0,57.0,2.0,0,
1,2001,1,1,1,-11.0,-94.0,10153.0,280.0,57.0,4.0,0,
2,2001,1,1,2,-17.0,-106.0,10161.0,290.0,62.0,2.0,0,
3,2001,1,1,3,-28.0,-100.0,10169.0,260.0,57.0,0.0,0,
4,2001,1,1,4,-28.0,-100.0,10177.0,260.0,52.0,0.0,0,


In [37]:
weatherData.shape

(8758, 12)

In [38]:
weatherData.dtypes

Year                                                          int64
Month                                                         int64
Day                                                           int64
Hour                                                          int64
Air Temp                                                    float64
Dew Point Temp                                              float64
Sea Level Pressure                                          float64
Wind Direction                                              float64
Wind Speed Rate                                             float64
Sky Condition Total Coverage Code                           float64
Liquid Precipitation Depth Dimension - 1Hr Duration           int64
Liquid Precipitation Depth Dimension - Six Hour Duration    float64
dtype: object

In [39]:
weatherData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8758 entries, 0 to 8757
Data columns (total 12 columns):
Year                                                        8758 non-null int64
Month                                                       8758 non-null int64
Day                                                         8758 non-null int64
Hour                                                        8758 non-null int64
Air Temp                                                    8757 non-null float64
Dew Point Temp                                              8754 non-null float64
Sea Level Pressure                                          8757 non-null float64
Wind Direction                                              8549 non-null float64
Wind Speed Rate                                             8755 non-null float64
Sky Condition Total Coverage Code                           8667 non-null float64
Liquid Precipitation Depth Dimension - 1Hr Duration         8758 non-null int64
L

In [40]:
weatherData.describe()

Unnamed: 0,Year,Month,Day,Hour,Air Temp,Dew Point Temp,Sea Level Pressure,Wind Direction,Wind Speed Rate,Sky Condition Total Coverage Code,Liquid Precipitation Depth Dimension - 1Hr Duration,Liquid Precipitation Depth Dimension - Six Hour Duration
count,8758.0,8758.0,8758.0,8758.0,8757.0,8754.0,8757.0,8549.0,8755.0,8667.0,8758.0,293.0
mean,2001.0,6.525006,15.720256,11.499087,139.056755,67.724126,10180.253511,202.850626,40.314106,4.492212,0.84517,29.638225
std,0.0,3.447779,8.796434,6.923071,96.444091,97.294557,71.305713,107.267956,22.614859,3.161479,6.527723,52.859586
min,2001.0,1.0,1.0,0.0,-84.0,-189.0,9897.0,0.0,0.0,0.0,-1.0,-1.0
25%,2001.0,4.0,8.0,5.25,61.0,-11.0,10132.0,120.0,26.0,2.0,0.0,7.0
50%,2001.0,7.0,16.0,11.0,139.0,72.0,10178.0,220.0,36.0,4.0,0.0,7.0
75%,2001.0,10.0,23.0,17.75,222.0,150.0,10226.0,290.0,52.0,8.0,0.0,30.0
max,2001.0,12.0,31.0,23.0,378.0,261.0,10408.0,360.0,170.0,9.0,262.0,531.0


### Investigating "The Sky Condition Total Coverage Code" using `value_counts()`

Let's see if we can easily calculate the percentage distribution of the different "Sky Condition Total Coverage" codes.

These are the different codes:

* 0 - No Clouds
* 2 - 2 Oktas
* 4 - 4 Oktas
* 6 - 6 Oktas
* 7 - 7 Oktas
* 8 - 8 Oktas
* 9 - Sky obscured or cloud amount can not be estimated
* -9999 - Missing

In [44]:
weatherData.shape

(8758, 12)

In [43]:
print(weatherData['Sky Condition Total Coverage Code'].value_counts())

print('\n\n')
print('Now, we divide by the total number of rows, and multiply by 100, to get percentage values:')
# Here we divided by the total rows and multipled by 100 to get the % of each
# cloud cover type in the data.
(weatherData['Sky Condition Total Coverage Code'].value_counts() / weatherData.shape[0]) * 100

8.0    2455
0.0    1920
4.0    1410
7.0    1341
2.0    1262
6.0     256
9.0      23
Name: Sky Condition Total Coverage Code, dtype: int64



Now, we divide by the total number of rows, and multiply by 100, to get percentage values:


8.0    28.031514
0.0    21.922813
4.0    16.099566
7.0    15.311715
2.0    14.409683
6.0     2.923042
9.0     0.262617
Name: Sky Condition Total Coverage Code, dtype: float64

## Note that most of the methods and attributes we have used have returned their output as dataframes or series 


You may have noticed that method slike `value_counts()` and `describe()` (and most of the others) have returned output in the form of dataframes or series.  This can be useful as you can apply all the same methods and arguments to those results as well.

<a name='WritingOutToCSV'></a>
# How To Write Data Back Out To A CSV

To write data out to a csv, we can use the `to_csv()` method. The doc are here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

Below, let's use `to_csv()` to write out our weatherData dataframe.

Note that pandas will write out the index as well. This means that when we read the file back in, we need to use the `index_col` argument of `read_csv()`. Alternatively, we can use the `index` argument of `to_csv()` to prevent the index from being written out in the first place,

In [45]:
# First, create the path to the file we would like to create
save_path = os.path.join(os.getcwd(), 'data', 'Philadelphia_Pennsylvania_USA', '724080-13739-2001_out')

# Then use to_csv
weatherData.to_csv(save_path)

#### Let's now read the file back in

Note that we will need use the `index_col` argument to read it in correctly. As an exercise, We will add the argument to the call below together in class.

In [46]:
weatherData2 = pd.read_csv(save_path)
weatherData2.head()

Unnamed: 0.1,Unnamed: 0,Year,Month,Day,Hour,Air Temp,Dew Point Temp,Sea Level Pressure,Wind Direction,Wind Speed Rate,Sky Condition Total Coverage Code,Liquid Precipitation Depth Dimension - 1Hr Duration,Liquid Precipitation Depth Dimension - Six Hour Duration
0,0,2001,1,1,0,-6.0,-94.0,10146.0,280.0,57.0,2.0,0,
1,1,2001,1,1,1,-11.0,-94.0,10153.0,280.0,57.0,4.0,0,
2,2,2001,1,1,2,-17.0,-106.0,10161.0,290.0,62.0,2.0,0,
3,3,2001,1,1,3,-28.0,-100.0,10169.0,260.0,57.0,0.0,0,
4,4,2001,1,1,4,-28.0,-100.0,10177.0,260.0,52.0,0.0,0,


#### Let's write out the data again, but this time use the `index` argument to not write out the index in the first place

In [None]:
weatherData.to_csv(save_path, index=False)
weatherData2 = pd.read_csv(save_path)
weatherData2.head()

<a name='LessonSummary'></a>
# Lesson Summary:
In this lesson you learned about the following:
* Methods and attributes that help describe data files:
    * head(n) - get the first n rows
    * tail(n) - get the last n rows
    * sample(n) - get a random sample of n rows
    * shape - get the number of rows and columns
    * columns - get the column names
    * dtypes - get the variable types of each column
    * info() - get the variables types, non-null counts, and memory size of the dataframe
    * memory_usage() - get the memory usage of each column of the data frame
    * describe() - get basic summary statistics about each numerical column
    * unique() - get the unique values in a column
    * nunique() - get the number of unique values in a column
    * value_counts() - get the occurrence counts for each value in a column
<br>
<br>
* Arguments to the read_csv() method that help you read in various file types:
    * sep - an argument that allows you to specify the field separate used (we saw commas and tabs)
    * index_col - an argument to specify the column used as the index
    * names - an argument to specify column names
    * header - an argument to specify which row to use as the header
    * columns - an attribute that can be set, to change the column names of a dataframe
<br>
<br>
* How to use to_csv() to write out data and how to use the index argument.


## In Class Exercise
In the cells below, Load the file "AAA_Fuel_Prices.csv" and use some of the methods we learned above to explore it.

In [47]:
aaa_filepath = os.path.join(os.getcwd(), 'data', 'AAA_Fuel_Prices.csv')
aaa_data = pd.read_csv(aaa_filepath)

In [48]:
aaa_data.head()

Unnamed: 0,Month_of_Price,County,Fuel,Price,PhysicalUnit
0,01/01/2006 12:00:00 AM,US,Gasoline - Regular,2.314,Dollars
1,01/01/2006 12:00:00 AM,US,Gasoline - Midgrade,2.457,Dollars
2,01/01/2006 12:00:00 AM,US,Gasoline - Premium,2.546,Dollars
3,01/01/2006 12:00:00 AM,US,Diesel,2.568,Dollars
4,01/01/2006 12:00:00 AM,State of Hawaii,Gasoline - Regular,2.8,Dollars


## Question or Comments About This Notebook?
Feel free to contact me via my LinkedIn: https://www.linkedin.com/in/william-j-henry <br>
You can also email me at will@henryanalytics.com <br>