# Modifying DataFrames

[Cheat Sheet](https://www.codecademy.com/learn/paths/data-science/tracks/data-processing-pandas/modules/dspath-intro-pandas/cheatsheet)


In the previous lesson, you learned what a DataFrame is and how to select subsets of data from one.

In this lesson, you’ll learn how to modify an existing DataFrame. Some of the skills you’ll learn include:

* Adding columns to a DataFrame
* Using lambda functions to calculate complex quantities
* Renaming columns



## Adding a Column I

Sometimes, we want to add a column to an existing DataFrame. We might want to add new information or perform a calculation based on the data that we already have.

One way that we can add a new column is by giving a list of the same length as the existing DataFrame.

Suppose we own a hardware store called The Handy Woman and have a DataFrame containing inventory information:

```
Product ID 	Product Description 	Cost to Manufacture 	Price
1 	3 inch screw 	0.50 	0.75
2 	2 inch nail 	0.10 	0.25
3 	hammer 	3.00 	5.50
4 	screwdriver 	2.50 	3.00
```

It looks like the actual quantity of each product in our warehouse is missing!

Let’s use the following code to add that information to our DataFrame.

**`df['Quantity'] = [100, 150, 50, 35]`**

Our new DataFrame looks like this:
```
Product ID 	Product Description 	Cost to Manufacture 	Price Quantity 
1 	3 inch screw 	0.50 	0.75  100
2 	2 inch nail 	0.10 	0.25  150
3 	hammer 	      3.00 	5.50  50
4 	screwdriver 	2.50 	3.00 35
```




In [9]:
import pandas as pd

df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

# The DataFrame df contains information on products sold at a hardware store. 
#Add a column to df called 'Sold in Bulk?',

df['Sold in Bulk?'] = ['Yes', 'Yes', 'No', 'No']

df

Unnamed: 0,Product ID,Description,Cost to Manufacture,Price,Sold in Bulk?
0,1,3 inch screw,0.5,0.75,Yes
1,2,2 inch nail,0.1,0.25,Yes
2,3,hammer,3.0,5.5,No
3,4,screwdriver,2.5,3.0,No


## Adding a Column II

We can also add **a new column that is the same for all rows** in the DataFrame. Let’s return to our inventory example:

```
Product ID 	Product Description 	Cost to Manufacture 	Price
1 	3 inch screw 	0.50 	0.75
2 	2 inch nail 	0.10 	0.25
3 	hammer 	3.00 	5.50
4 	screwdriver 	2.50 	3.00
```

Suppose we know that **all of our products are currently in-stock**. We can add a column that says this:

`df['In Stock?'] = True`

Now all of the rows have a column called In Stock? with value True.


```
Product ID 	Product Description 	Cost to Manufacture 	Price In Stock? 
1 	3 inch screw 	0.50 	0.75 true
2 	2 inch nail 	0.10 	0.25 true
3 	hammer 	      3.00 	5.50 true
4 	screwdriver 	2.50 	3.00 true
```




In [14]:
import pandas as pd

df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

# Add columns here

df["Is taxed?"] = True

df

Unnamed: 0,Product ID,Description,Cost to Manufacture,Price,Is taxed?
0,1,3 inch screw,0.5,0.75,True
1,2,2 inch nail,0.1,0.25,True
2,3,hammer,3.0,5.5,True
3,4,screwdriver,2.5,3.0,True


In [15]:
import pandas as pd

df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)
# adding the same "yes" to all rowa
df['Is taxed?'] = 'Yes'

df

Unnamed: 0,Product ID,Description,Cost to Manufacture,Price,Is taxed?
0,1,3 inch screw,0.5,0.75,Yes
1,2,2 inch nail,0.1,0.25,Yes
2,3,hammer,3.0,5.5,Yes
3,4,screwdriver,2.5,3.0,Yes


## Adding a Column III

Finally, you can add a new column by performing a function on the existing columns.

Maybe we want to add a column to our inventory table with the amount of sales tax that we need to charge for each item. The following code multiplies each Price by 0.075, the sales tax for our state:

`df['Sales Tax'] = df.Price * 0.075`

Now our table has a column called Sales Tax:
```
Product ID 	Product Description 	Cost to Manufacture 	Price 	Sales Tax
1 	3 inch screw 	0.50 	0.75 	0.06
2 	2 inch nail 	0.10 	0.25 	0.02
3 	hammer 	      3.00 	5.50 	0.41
4 	screwdriver 	2.50 	3.00 	0.22
```



In [17]:
import pandas as pd

df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

# Add a column to df called 'Margin', which is equal to the difference between 
# the Price and the Cost to Manufacture.

df["Margin"] = df.Price - df["Cost to Manufacture"]

df



Unnamed: 0,Product ID,Description,Cost to Manufacture,Price,Margin
0,1,3 inch screw,0.5,0.75,0.25
1,2,2 inch nail,0.1,0.25,0.15
2,3,hammer,3.0,5.5,2.5
3,4,screwdriver,2.5,3.0,0.5


## Performing Column Operations

In the previous exercise, we learned how to add columns to a DataFrame.

Often, the column that we want to add is related to existing columns, but requires a calculation more complex than multiplication or addition.

For example, imagine that we have the following table of customers.
```
Name 	Email
JOHN SMITH 	john.smith@gmail.com
Jane Doe 	jdoe@yahoo.com
joe schmo 	joeschmo@hotmail.com
```

It’s a little annoying that the capitalization is different for each row. Perhaps we’d like to make it more consistent by making all of the letters uppercase.

We can use the `apply` function to apply a function to every value in a particular column. For example, this code overwrites the existing `'Name'` columns by applying the function `upper` to every row in `'Name'`.
```
from string import upper
df['Name'] = df.Name.apply(upper)
```
The result:
```
Name 	Email
JOHN SMITH 	john.smith@gmail.com
JANE DOE 	jdoe@yahoo.com
JOE SCHMO 	joeschmo@hotmail.com
```


In [22]:
pip install strings

Collecting strings
  Using cached https://files.pythonhosted.org/packages/bf/25/472d34792ee3816edcab999b5c00ce5e468ebd928ab5eff77f2a6738e37a/strings-0.1.2.tar.gz
[31mERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.[0m


In [0]:
from string import lower
import pandas as pd

df = pd.DataFrame([
  ['JOHN SMITH', 'john.smith@gmail.com'],
  ['Jane Doe', 'jdoe@yahoo.com'],
  ['joe schmo', 'joeschmo@hotmail.com']
],
columns=['Name', 'Email'])


# Apply the function lower to all names in column 'Name' in df. Assign these new
#  names to a new column of df called 'Lowercase Name'. 
#  The final DataFrame should look like this:
# Add columns here

# creating new column 
df['Lowercase Name'] = df.Name.apply(lower)


## Reviewing Lambda Function

A lambda function is a way of defining a function in a single line of code. Usually, we would assign them to a variable.

For example, the following lambda function multiplies a number by 2 and then adds 3:
```
mylambda = lambda x: (x * 2) + 3
print(mylambda(5))
```
The output:
```
> 13
```
Lambda functions work with all types of variables, not just integers! Here is an example that takes in a string, assigns it to the temporary variable x, and then converts it into lowercase:
```
stringlambda = lambda x: x.lower()
print(stringlambda("Oh Hi Mark!"))
```
The output:
```
> "oh hi mark!"
```
Learn more about lambda functions in [this article!](https://www.codecademy.com/articles/lambda-functions)


Create a lambda function mylambda that returns the first and last letters of a string, assuming the string is at least 2 characters long. For example,

`print(mylambda('This is a string'))`

should produce:

`'Tg'`

In [26]:
mylambda = lambda str: str[0]+ str[-1]
mylambda("Love Letter")

'Lr'

## Reviewing Lambda Function: If Statements

We can make our lambdas more complex by using a modified form of an if statement.

Suppose we want to pay workers time-and-a-half for overtime (any work above 40 hours per week). The following function will convert the number of hours into time-and-a-half hours using an if statement:
```
def myfunction(x):
    if x > 40:
        return 40 + (x - 40) * 1.50
    else:
        return x
```
Below is a lambda function that does the same thing:
```
myfunction = lambda x: 40 + (x - 40) * 1.50 if x > 40 else x
```
In general, the syntax for an if function in a lambda function is:
```
lambda x: [OUTCOME IF TRUE] if [CONDITIONAL] else [OUTCOME IF FALSE]
```


In [33]:
# You are managing the webpage of a somewhat violent video game and you want to 
# check that each user’s age is 13 or greater when they visit the site.

mylambda = lambda age: "Welcome to BattleCity!" if age >= 13 else  "You must be over 13"

print(mylambda(13))
print(mylambda(10))


Welcome to BattleCity!
You must be over 13


## Applying a Lambda to a Column

In Pandas, we often use lambda functions to perform complex operations on columns. For example, suppose that we want to create a column containing the email provider for each email address in the following table:
```
Name 	Email
JOHN SMITH 	john.smith@gmail.com
Jane Doe 	jdoe@yahoo.com
joe schmo 	joeschmo@hotmail.com
```
We could use the following code with a lambda function and the string method `.split()`:
```
df['Email Provider'] = df.Email.apply(
    lambda x: x.split('@')[-1]
    )
```
The result would be:

```
Name 	Email 	Email Provider
JOHN SMITH 	john.smith@gmail.com 	gmail.com
Jane Doe 	jdoe@yahoo.com 	yahoo.com
joe schmo 	joeschmo@hotmail.com 	hotmail.com
```



In [35]:
import pandas as pd

df = pd.read_csv('employees.csv')

# Add columns here
# Create a lambda function get_last_name which takes a string with someone’s 
# first and last name (i.e., John Smith), and returns just the last name 
# (i.e., Smith).

get_last_name = lambda x: x.split()[-1]

# Use the lambda function get_last_name to create a new column
#  last_name with only the employees’ last name.

df['last_name'] = df.name.apply(get_last_name)

df


Unnamed: 0,id,name,hourly_wage,hours_worked,last_name
0,10310,Lauren Durham,19,43,Durham
1,18656,Grace Sellers,17,40,Sellers
2,61254,Shirley Rasmussen,16,30,Rasmussen
3,16886,Brian Rojas,18,47,Rojas
4,89010,Samantha Mosley,11,38,Mosley
5,87246,Louis Guzman,14,39,Guzman
6,20578,Denise Mcclure,15,40,Mcclure
7,12869,James Raymond,15,32,Raymond
8,53461,Noah Collier,18,35,Collier
9,14746,Donna Frederick,20,41,Frederick


## Applying a Lambda to a Row

We can also operate on multiple columns at once. If we use `apply` without specifying a single column and add the argument `axis=1`, the input to our lambda function will be an entire row, not a column. To access particular values of the row, we use the syntax `row.column_name` or `row[‘column_name’]`.

Suppose we have a table representing a grocery list:
```
Item 	       Price 	   Is taxed?
Apple        	1.00    	No
Milk 	        4.20    	No
Paper Towels 	5.00    	Yes
Light Bulbs 	 3.75 	   Yes
```
If we want to add in the price with tax for each line, we’ll need to look at two columns: `Price` and `Is taxed?`.

If `Is taxed?` is `Yes`, then we’ll want to multiply `Price` by 1.075 (for 7.5% sales tax).

`If Is taxed?` is `No`, we’ll just have `Price` without multiplying it.

We can create this column using a lambda function and the keyword axis=1:

df['Price with Tax'] = df.apply(lambda row:
     row['Price'] * 1.075
     if row['Is taxed?'] == 'Yes'
     else row['Price'],
     axis=1
)



In [40]:
import pandas as pd

df = pd.read_csv('employees.csv')

print(df)

total_earned = lambda row: (row.hourly_wage * 40) + ((row.hourly_wage * 1.5) * (row.hours_worked - 40)) \
	if row.hours_worked > 40 \
  else row.hourly_wage * row.hours_worked
  

df['total_earned'] = df.apply(total_earned, axis = 1)

df

       id               name  hourly_wage  hours_worked
0   10310      Lauren Durham           19            43
1   18656      Grace Sellers           17            40
2   61254  Shirley Rasmussen           16            30
3   16886        Brian Rojas           18            47
4   89010    Samantha Mosley           11            38
5   87246       Louis Guzman           14            39
6   20578     Denise Mcclure           15            40
7   12869      James Raymond           15            32
8   53461       Noah Collier           18            35
9   14746    Donna Frederick           20            41
10  71127       Shirley Beck           14            32
11  92522    Christina Kelly            8            44
12  22447        Brian Noble           11            39
13  61654          Randy Key           16            38
14  16988      Diana Stewart           14            48
15  68619       Timothy Sosa           14            42
16  59949      Betty Skinner           11       

Unnamed: 0,id,name,hourly_wage,hours_worked,total_earned
0,10310,Lauren Durham,19,43,845.5
1,18656,Grace Sellers,17,40,680.0
2,61254,Shirley Rasmussen,16,30,480.0
3,16886,Brian Rojas,18,47,909.0
4,89010,Samantha Mosley,11,38,418.0
5,87246,Louis Guzman,14,39,546.0
6,20578,Denise Mcclure,15,40,600.0
7,12869,James Raymond,15,32,480.0
8,53461,Noah Collier,18,35,630.0
9,14746,Donna Frederick,20,41,830.0


## Renaming Columns

When we get our data from other sources, we often want to change the column names. For example, we might want all of the column names to follow variable name rules, so that we can use **`df.column_name`** (which tab-completes) rather than **`df['column_name']`** (which takes up extra space).

You can change all of the column names at once by setting the `.columns` property to a different list. This is great when you need to change all of the column names at once, but be careful! You can easily mislabel columns if you get the ordering wrong. Here’s an example:
```
df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
df.columns = ['First Name', 'Age']
```
This command edits the existing DataFrame df.


In [4]:
import pandas as pd

df = pd.read_csv('imdb.csv')

# Rename columns here
df.columns = ['ID', 'Title', 'Category', 'Year Released', 'Rating']
df.head()

Unnamed: 0,ID,Title,Category,Year Released,Rating
0,1,Avatar,action,2009,7.9
1,2,Jurassic World,action,2015,7.3
2,3,The Avengers,action,2012,8.1
3,4,The Dark Knight,action,2008,9.0
4,5,Star Wars: Episode I - The Phantom Menace,action,1999,6.6


## Renaming Columns II

You also can rename individual columns by using the `.rename` method. Pass a dictionary like the one below to the columns keyword argument:
```
{'old_column_name1': 'new_column_name1', 'old_column_name2': 'new_column_name2'}
```
Here’s an example:
```
df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
df.rename(columns={
    'name': 'First Name',
    'age': 'Age'},
    inplace=True)
```
The code above will rename name to First Name and age to Age.

Using rename with only the columns keyword will create a new DataFrame, leaving your original DataFrame unchanged. That’s why we also passed in the keyword argument inplace=True. Using inplace=True lets us edit the original DataFrame.

There are several reasons why `.rename` is preferable to `.columns`:

* You can rename just one column
* You can be specific about which column names are getting changed (with `.column` you can accidentally switch column names if you’re not careful)

**Note: *If you misspell one of the original column names, this command won’t fail. It just won’t change anything.**

In [5]:
import pandas as pd

df = pd.read_csv('imdb.csv')

df.rename(columns = {'name': 'movie_title'}, 
# so we edit the original dataframe 
inplace = True)

df.head()

Unnamed: 0,id,movie_title,genre,year,imdb_rating
0,1,Avatar,action,2009,7.9
1,2,Jurassic World,action,2015,7.3
2,3,The Avengers,action,2012,8.1
3,4,The Dark Knight,action,2008,9.0
4,5,Star Wars: Episode I - The Phantom Menace,action,1999,6.6


## Review

Great job! In this lesson, you learned how to modify an existing DataFrame. Some of the skills you’ve learned include:

* Adding columns to a DataFrame
* Using lambda functions to calculate complex quantities
* Renaming columns

Let’s practice what you just learned!


Once more, you’ll be the data analyst for ShoeFly.com, a fictional online shoe store.

 


In [6]:
import pandas as pd

orders = pd.read_csv('shoefly.csv')

# More messy order data has been loaded into the variable orders. 
# Examine the first 5 rows of the data using print and head.

orders.head()


Unnamed: 0,id,first_name,last_name,gender,email,shoe_type,shoe_material,shoe_color
0,54791,Rebecca,Lindsay,female,RebeccaLindsay57@hotmail.com,clogs,faux-leather,black
1,53450,Emily,Joyce,female,EmilyJoyce25@gmail.com,ballet flats,faux-leather,navy
2,91987,Joyce,Waller,female,Joyce.Waller@gmail.com,sandles,fabric,black
3,14437,Justin,Erickson,male,Justin.Erickson@outlook.com,clogs,faux-leather,red
4,79357,Andrew,Banks,male,AB4318@gmail.com,boots,leather,brown


In [7]:
# Many of our customers want to buy vegan shoes (shoes made from materials that 
# do not come from animals). Add a new column called shoe_source, which is vegan 
# if the materials is not leather and animal otherwise.


shoe_source = lambda row: "vegan" if row.shoe_material != "leather" else "animal"
orders['shoe_source'] = orders.apply(shoe_source, axis = 1)

orders.head()

Unnamed: 0,id,first_name,last_name,gender,email,shoe_type,shoe_material,shoe_color,shoe_source
0,54791,Rebecca,Lindsay,female,RebeccaLindsay57@hotmail.com,clogs,faux-leather,black,vegan
1,53450,Emily,Joyce,female,EmilyJoyce25@gmail.com,ballet flats,faux-leather,navy,vegan
2,91987,Joyce,Waller,female,Joyce.Waller@gmail.com,sandles,fabric,black,vegan
3,14437,Justin,Erickson,male,Justin.Erickson@outlook.com,clogs,faux-leather,red,vegan
4,79357,Andrew,Banks,male,AB4318@gmail.com,boots,leather,brown,animal


Our marketing department wants to send out an email to each customer. Using the columns last_name and gender create a column called salutation which contains Dear Mr. `<last_name>` for men and Dear Ms. `<last_name>` for women.

In [8]:

salutation = lambda row: \
  'Dear Mr. {}'.format(row.last_name) \
  if row.gender == 'male' \
  else 'Dear Ms. {}'.format(row.last_name) 


orders["salutation"] = orders.apply(salutation, axis = 1)

orders.head()


Unnamed: 0,id,first_name,last_name,gender,email,shoe_type,shoe_material,shoe_color,shoe_source,salutation
0,54791,Rebecca,Lindsay,female,RebeccaLindsay57@hotmail.com,clogs,faux-leather,black,vegan,Dear Ms. Lindsay
1,53450,Emily,Joyce,female,EmilyJoyce25@gmail.com,ballet flats,faux-leather,navy,vegan,Dear Ms. Joyce
2,91987,Joyce,Waller,female,Joyce.Waller@gmail.com,sandles,fabric,black,vegan,Dear Ms. Waller
3,14437,Justin,Erickson,male,Justin.Erickson@outlook.com,clogs,faux-leather,red,vegan,Dear Mr. Erickson
4,79357,Andrew,Banks,male,AB4318@gmail.com,boots,leather,brown,animal,Dear Mr. Banks


## Test

![text](https://i.imgur.com/lz2bpeC.png)
![alt text](https://i.imgur.com/V6FUUUI.png)
![alt text](https://i.imgur.com/KByX7wq.png)
![alt text](https://i.imgur.com/Y5V698f.png)
![alt text](https://i.imgur.com/bNJbc7a.png)
![alt text](https://i.imgur.com/CVB9WRd.png)
