# Creating Columns, Dropping Columns, and Setting Values In A DataFrame
## Notebook Outline:

* <a href='#AddNewColumns'>Adding New Columns To A DataFrame</a>
* <a href='#DroppingColumns'>Dropping Columns From DataFrames</a>
* <a href='#UpdatingColumnValues'>Updating Column Values</a>
* <a href='#SettingSpecificValues'>Setting Specific Values</a>

# How to use this Notebook

The best way to use this notebook is to follow along with the lecture and then to apply what you learn to your own data files, or (if you do not have any of your own data) to practice using this functions and methods on the provided data. A little practice goes a long way towards understand and retaining! It would be easy to just skim this notebook, but you will learn more by doing!

<a name='AddNewColumns'></a>
# Adding New Columns to a DataFrame
Adding new columns to a DataFrame is straightforward. We just need to use bracket notation, (without `.loc[]` or `.iloc[]`)
We need a dataset to practice on, so let's load the labor sheet data.

In [1]:
# In this cell we import pandas and load the datafile.
import pandas as pd
import os

filepath = os.path.join(os.getcwd(), 'data', 'ShiftManagerApp_LaborSheet.csv')
labor_sheet_data = pd.read_csv(filepath, parse_dates=[['Date', 'Ending_Hour'], 'Timestamp'])

In [2]:
labor_sheet_data.head()

Unnamed: 0,Date_Ending_Hour,Store_ID,Manager,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage
0,2017-01-23 08:00:00,4462,JillianA,540.0,420.0,170.0,,100.0,,,,,,2017-01-23 09:52:14,,
1,2017-02-05 06:00:00,4462,ZoeyD,90.0,155.0,114.0,,78.0,,,,,,2017-02-05 11:30:48,,
2,2017-02-05 07:00:00,4462,JessicaB,173.0,182.0,106.0,,81.0,,,,,,2017-02-05 11:35:48,,
3,2017-02-05 08:00:00,4462,JessicaB,333.0,311.0,102.0,,55.0,,,,,,2017-02-05 11:52:05,,
4,2017-02-05 09:00:00,4462,JessicaB,594.0,598.0,155.0,,106.0,,,,,,2017-02-05 11:59:35,,


## Creating A Sales +/- Column
It would probably be useful to have a column that indicates how much `Sales` differed from `Projected Sales`. Let's calculate this:


In [3]:
labor_sheet_data["Sales +/-"] = labor_sheet_data['Sales'] - labor_sheet_data['Projected_Sales']
labor_sheet_data.head()

Unnamed: 0,Date_Ending_Hour,Store_ID,Manager,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage,Sales +/-
0,2017-01-23 08:00:00,4462,JillianA,540.0,420.0,170.0,,100.0,,,,,,2017-01-23 09:52:14,,,-120.0
1,2017-02-05 06:00:00,4462,ZoeyD,90.0,155.0,114.0,,78.0,,,,,,2017-02-05 11:30:48,,,65.0
2,2017-02-05 07:00:00,4462,JessicaB,173.0,182.0,106.0,,81.0,,,,,,2017-02-05 11:35:48,,,9.0
3,2017-02-05 08:00:00,4462,JessicaB,333.0,311.0,102.0,,55.0,,,,,,2017-02-05 11:52:05,,,-22.0
4,2017-02-05 09:00:00,4462,JessicaB,594.0,598.0,155.0,,106.0,,,,,,2017-02-05 11:59:35,,,4.0


## In Class Exercise: Exploring The Sales +/- Column

Using what we have learned in the course so far, let's look for outliers in the 'Sales +/-' column. Focus on the max and min values, which dates did this occur on, who were the managers, and do the values look reasonable or unreasonable based on the 'Sales +/-' values from the nearby hours?

In [4]:
labor_sheet_data['Sales +/-'].describe()

count     23566.000000
mean         13.273909
std        2106.412821
min       -4319.000000
25%         -56.000000
50%          -2.405000
75%          50.000000
max      320408.000000
Name: Sales +/-, dtype: float64

## In Class Exercise: Add A "Data Entry Timedelta" Column
The 'TimeStamp' column indicates when the data was entered, the 'Date_Hour' column indicates the date and hour of the *end* of that hour data, i.e. the row for "2018-11-01 08:00:00" should contain the data for the hour that ends on "2018-11-01 08:00:00" (so 7am to 8am on that date).  The data, ideally, is entered within 10 minutes of the end of the hour. Let's calculate this delta.

In the cell below, calculate a new column called `Data Entry Timeliness`, which will be the `Timestamp` column minus the `Date_Ending_Hour` columns

In [6]:
labor_sheet_data["Data Entry Timeliness"] = labor_sheet_data['Timestamp'] - labor_sheet_data["Date_Ending_Hour"]

In [7]:
labor_sheet_data.head(1)

Unnamed: 0,Date_Ending_Hour,Store_ID,Manager,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage,Sales +/-,Data Entry Timeliness
0,2017-01-23 08:00:00,4462,JillianA,540.0,420.0,170.0,,100.0,,,,,,2017-01-23 09:52:14,,,-120.0,01:52:14


### Introducing `dt.total_seconds()`
We saw in a previous notebook that we need to use the `.dt` attribute to access datetime methods. One of those methods is `total_seconds`, this will return the total number of seconds from a timedelta. Let's use it below.

This is an important method to know to get the total amount of time in seconds.  In contrast, if you use the `.dt.days` method, this will not give you the total amount of time in days, it will just give the days component of the time.

In [17]:
labor_sheet_data['Data Entry Late Minutes'] = \
    labor_sheet_data["Data Entry Timeliness"].dt.total_seconds() / 60

## In Class Exercise: Answer The Following Questions:

* How many entries have a *negative* 'Data Entry Late Minutes'
* How many entries have a 'Data Entry Late Minutes' that is within 10 minutes (but not negative). What percentage of entries is this?
* How many entries have a 'Data Entry Late Minutes' that is greater than 4 hours?
* What is the maximum 'Data Entry Late Minutes'?

In [19]:
problem_1 = (labor_sheet_data['Data Entry Late Minutes'] < 0).sum()

print(problem_1)

((labor_sheet_data['Data Entry Late Minutes'] <= 10) & (labor_sheet_data['Data Entry Late Minutes'] > 0)).mean()

5319


0.1513485925169801

<a name=DroppingColumns></a>
# Dropping Columns From a DataFrame
We can use the `.drop` method to drop columns from the dataframe. We just need to specify the we are dropping a label from axis 1. The `inplace` argument allows us to update the dataframe itself, and not just output a new dataframe with the column dropped.

The data in the column "Park_Percentage", contains a lot of missing data. Let's drop it.

In [20]:
labor_sheet_data.drop('Park_Percentage', axis=1, inplace=True)

In [21]:
labor_sheet_data.head()

Unnamed: 0,Date_Ending_Hour,Store_ID,Manager,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Sales +/-,Data Entry Timeliness,Data Entry Late Minutes
0,2017-01-23 08:00:00,4462,JillianA,540.0,420.0,170.0,,100.0,,,,,,2017-01-23 09:52:14,,-120.0,01:52:14,112.233333
1,2017-02-05 06:00:00,4462,ZoeyD,90.0,155.0,114.0,,78.0,,,,,,2017-02-05 11:30:48,,65.0,05:30:48,330.8
2,2017-02-05 07:00:00,4462,JessicaB,173.0,182.0,106.0,,81.0,,,,,,2017-02-05 11:35:48,,9.0,04:35:48,275.8
3,2017-02-05 08:00:00,4462,JessicaB,333.0,311.0,102.0,,55.0,,,,,,2017-02-05 11:52:05,,-22.0,03:52:05,232.083333
4,2017-02-05 09:00:00,4462,JessicaB,594.0,598.0,155.0,,106.0,,,,,,2017-02-05 11:59:35,,4.0,02:59:35,179.583333


<a name=UpdatingColumnValues></a>
# Updating Values of a DataFrame Column

Let's now learn how to set values in a DataFrame column. We will use a different dataset for this. Let's load the Philadelphia Airport Weather Dataset.

## Updating the Air Temp Column in Our Weather Data

The data in the "Air Temp" column needs to be divided by 10 to be put in the proper decimal notation. That is, the -6 value is really -0.6 Celsius. This is described in the data documentation which can be found here <ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite/isd-lite-format.pdf>

So, let's go ahead and update that column by dividing all the values by 10! We can do this with the same mathematical operators that we learned in the lecture on mathematical operators. 

However, first we must read in the data.

In [22]:
# In this cell we import pandas and load the datafile.
import pandas as pd
import os

filepath = os.path.join(os.getcwd(), 'data', 'Philadelphia_Pennsylvania_USA/724080-13739-2001')

headers = ['Year', 'Month', 'Day', 'Hour', 'Air Temp', 'Dew Point Temp', 'Sea Level Pressure',
           'Wind Direction', 'Wind Speed Rate',
           'Sky Condition Total Coverage Code',
           'Liquid Precipitation Depth Dimension - 1Hr Duration',
           'Liquid Precipitation Depth Dimension - Six Hour Duration']
weatherData = pd.read_csv(filepath, delim_whitespace=True,
                          names=headers)
weatherData.head()

Unnamed: 0,Year,Month,Day,Hour,Air Temp,Dew Point Temp,Sea Level Pressure,Wind Direction,Wind Speed Rate,Sky Condition Total Coverage Code,Liquid Precipitation Depth Dimension - 1Hr Duration,Liquid Precipitation Depth Dimension - Six Hour Duration
0,2001,1,1,0,-6,-94,10146,280,57,2,0,-9999
1,2001,1,1,1,-11,-94,10153,280,57,4,0,-9999
2,2001,1,1,2,-17,-106,10161,290,62,2,0,-9999
3,2001,1,1,3,-28,-100,10169,260,57,0,0,-9999
4,2001,1,1,4,-28,-100,10177,260,52,0,0,-9999


### Updating the data in the "Air Temp" column by dividing by 10

In [23]:
weatherData.loc[:, 'Air Temp'] = weatherData['Air Temp']/10
weatherData.head()

Unnamed: 0,Year,Month,Day,Hour,Air Temp,Dew Point Temp,Sea Level Pressure,Wind Direction,Wind Speed Rate,Sky Condition Total Coverage Code,Liquid Precipitation Depth Dimension - 1Hr Duration,Liquid Precipitation Depth Dimension - Six Hour Duration
0,2001,1,1,0,-0.6,-94,10146,280,57,2,0,-9999
1,2001,1,1,1,-1.1,-94,10153,280,57,4,0,-9999
2,2001,1,1,2,-1.7,-106,10161,290,62,2,0,-9999
3,2001,1,1,3,-2.8,-100,10169,260,57,0,0,-9999
4,2001,1,1,4,-2.8,-100,10177,260,52,0,0,-9999


### Converting the Air Temp from Celsius to Fahrenheit - in a _new_ column
Now we will convert the Celsius air temp values to Fahrenheit values in a new column of data.

Exercise: Create an `Air Temp F` column below by converting the values in teh `Air Temp` column to Fahrenheit. The formula is `F = C * 1.8 + 32`.

In [24]:
weatherData['Air Temp F'] = weatherData['Air Temp'] * 1.8 + 32

In [25]:
weatherData.head(1)

Unnamed: 0,Year,Month,Day,Hour,Air Temp,Dew Point Temp,Sea Level Pressure,Wind Direction,Wind Speed Rate,Sky Condition Total Coverage Code,Liquid Precipitation Depth Dimension - 1Hr Duration,Liquid Precipitation Depth Dimension - Six Hour Duration,Air Temp F
0,2001,1,1,0,-0.6,-94,10146,280,57,2,0,-9999,30.92


<a name=SettingSpecificValues></a>
# Setting Specific Values in a DataFrame

### Changing the -9999 values to a Null value that Pandas will recognize.
We can use the `.loc` notation to set specific values in a column.  For example, let's change the -9999 values in the 'Liquid Precipitation Depth Dimension - Six Hour Duration' column to None values.

In [26]:
weatherData.loc[weatherData['Liquid Precipitation Depth Dimension - Six Hour Duration'] == -9999,
                'Liquid Precipitation Depth Dimension - Six Hour Duration'] = None
weatherData.head()

Unnamed: 0,Year,Month,Day,Hour,Air Temp,Dew Point Temp,Sea Level Pressure,Wind Direction,Wind Speed Rate,Sky Condition Total Coverage Code,Liquid Precipitation Depth Dimension - 1Hr Duration,Liquid Precipitation Depth Dimension - Six Hour Duration,Air Temp F
0,2001,1,1,0,-0.6,-94,10146,280,57,2,0,,30.92
1,2001,1,1,1,-1.1,-94,10153,280,57,4,0,,30.02
2,2001,1,1,2,-1.7,-106,10161,290,62,2,0,,28.94
3,2001,1,1,3,-2.8,-100,10169,260,57,0,0,,26.96
4,2001,1,1,4,-2.8,-100,10177,260,52,0,0,,26.96


### Run `.info()` again to now see how many non-null values there are in 'Liquid Precipitation Depth Dimension - Six Hour Duration'

In [27]:
weatherData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8758 entries, 0 to 8757
Data columns (total 13 columns):
Year                                                        8758 non-null int64
Month                                                       8758 non-null int64
Day                                                         8758 non-null int64
Hour                                                        8758 non-null int64
Air Temp                                                    8758 non-null float64
Dew Point Temp                                              8758 non-null int64
Sea Level Pressure                                          8758 non-null int64
Wind Direction                                              8758 non-null int64
Wind Speed Rate                                             8758 non-null int64
Sky Condition Total Coverage Code                           8758 non-null int64
Liquid Precipitation Depth Dimension - 1Hr Duration         8758 non-null int64
Liquid Prec

# Lesson Summary:
In this lesson you learned:
* How to create new columns in a DataFrame.
* How to update the values of an entire column.
* How to update specific values in a DataFrame.

## Question or Comments About This Notebook?
Feel free to contact me via my LinkedIn: https://www.linkedin.com/in/william-j-henry <br>
You can also email me at will@henryanalytics.com <br>