# CREATING, LOADING, AND SELECTING DATA WITH PANDAS

## Importing the Pandas Module

> from Codecademy

Pandas is a Python module for working with tabular data (i.e., data in a table with rows and columns). Tabular data has a lot of the same functionality as SQL or Excel, but Pandas adds the power of Python.

In order to get access to the Pandas module, we’ll need to install the module and then import it into a Python file. 

The pandas module is usually imported at the top of a Python file under the alias `pd`.

```python
import pandas as pd
```

If we need to access the `pandas` module, we can do so by operating on `pd`.

In this lesson, you’ll learn the basics of working with a single table in Pandas, such as:

- Create a table from scratch
- Loading data from another file
- Selecting certain rows or columns of a table


## Create a DataFrame I

A DataFrame is an object that stores data as rows and columns. You can think of a DataFrame as a spreadsheet or as a SQL table. You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.

DataFrames have rows and columns. Each column has a name, which is a string. Each row has an index, which is an integer. DataFrames can contain many different data types: strings, ints, floats, tuples, etc.

You can pass in a dictionary to `pd.DataFrame()`. Each key is a column name and each value is a list of column values. The columns must all be the same length or you will get an error. Here’s an example:

```python
df1 = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Joe Schmo'],
    'address': ['123 Main St.', '456 Maple Ave.', '789 Broadway'],
    'age': [34, 28, 51]
})
```

## Create a DataFrame II

You can also add data using lists.

For example, you can pass in a list of lists, where each one represents a row of data. Use the keyword argument columns to pass a list of column names.

```python
df2 = pd.DataFrame([
    ['John Smith', '123 Main St.', 34],
    ['Jane Doe', '456 Maple Ave.', 28],
    ['Joe Schmo', '789 Broadway', 51]
    ],
    columns=['name', 'address', 'age'])
```

## Comma Separated Variables (CSV)

We now know how to create our own DataFrame. However, most of the time, we’ll be working with datasets that already exist. One of the most common formats for big datasets is the CSV.

CSV (comma separated values) is a text-only spreadsheet format. You can find CSVs in lots of places:

- Online datasets (here’s an example from data.gov)
- Export from Excel or Google Sheets
- Export from SQL

The first row of a CSV contains column headings. All subsequent rows contain values. Each column heading and each variable is separated by a comma:

```
column1,column2,column3
value1,value2,value3
```

cupcakes.csv:

```
name,cake_flavor,frosting_flavor,topping
Chocolate Cake,chocolate,chocolate,chocolate shavings
Birthday Cake,vanilla,vanilla,rainbow sprinkles
Carrot Cake,carrot,cream cheese,almonds
```

## Loading and Saving CSVs

When you have data in a CSV, you can load it into a DataFrame in Pandas using `.read_csv()`:

```python
pd.read_csv('my-csv-file.csv')
```

In the example above, the `.read_csv()` method is called. The CSV file called `my-csv-file` is passed in as an argument.

We can also save data to a CSV, using `.to_csv()`.

```python
df.to_csv('new-csv-file.csv')
```

In the example above, the `.to_csv()` method is called on `df` (which represents a DataFrame object). The name of the CSV file is passed in as an argument (`new-csv-file.csv`). By default, this method will save the CSV file in your current directory.

## Inspect a DataFrame

When we load a new DataFrame from a CSV, we want to know what it looks like.

If it’s a small DataFrame, you can display it by typing `print(df)`.

If it’s a larger DataFrame, it’s helpful to be able to inspect a few items without having to look at the entire DataFrame.

The method `.head()` gives the first 5 rows of a DataFrame. If you want to see more rows, you can pass in the positional argument `n`. For example, `df.head(10)` would show the first 10 rows.

The method `df.info()` gives some statistics for each column.

## Select Columns

Now we know how to create and load data. Let’s select parts of those datasets that are interesting or important to our analyses.

Suppose you have the DataFrame called `customers`, which contains the ages of your customers:

```
name	age
Rebecca Erikson	35
Thomas Roberson	28
Diane Ochoa	42
…	…

```

Perhaps you want to take the average or plot a histogram of the ages. In order to do either of these tasks, you’d need to select the column.

There are two possible syntaxes for selecting all values from a column:

1. Select the column as if you were selecting a value from a dictionary using a key. In our example, we would type `customers['age']` to select the ages.
2. If the name of a column follows all of the rules for a variable name (doesn’t start with a number, doesn’t contain spaces or special characters, etc.), then you can select it using the following notation: `df.MySecondColumn`. In our example, we would type `customers.age`.

When we select a single column, the result is called a _Series_.

## Selecting Multiple Columns

When you have a larger DataFrame, you might want to select just a few columns.

For instance, let’s return to a DataFrame of orders from ShoeFly.com:

```
id	first_name	last_name	email	shoe_type	shoe_material	shoe_color
54791	Rebecca	Lindsay	RebeccaLindsay57@hotmail.com	clogs	faux-leather	black
53450	Emily	Joyce	EmilyJoyce25@gmail.com	ballet flats	faux-leather	navy
91987	Joyce	Waller	Joyce.Waller@gmail.com	sandals	fabric	black
14437	Justin	Erickson	Justin.Erickson@outlook.com	clogs	faux-leather	red

```

We might just be interested in the customer’s last_name and email. We want a DataFrame like this:

```
last_name	email
Lindsay	RebeccaLindsay57@hotmail.com
Joyce	EmilyJoyce25@gmail.com
Waller	Joyce.Waller@gmail.com
Erickson	Justin.Erickson@outlook.com

```

To select two or more columns from a DataFrame, we use a list of the column names. To create the DataFrame shown above, we would use:

```python
new_df = orders[['last_name', 'email']]
```
**Note:** Make sure that you have a double set of brackets (`[[]]`), or this command won’t work!

## Select Rows

Let’s revisit our `orders` from ShoeFly.com:

```
id	first_name	last_name	email	shoe_type	shoe_material	shoe_color
54791	Rebecca	Lindsay	RebeccaLindsay57@hotmail.com	clogs	faux-leather	black
53450	Emily	James	EmilyJames25@gmail.com	ballet flats	faux-leather	navy
91987	Joyce	Waller	Joyce.Waller@gmail.com	sandals	fabric	black
14437	Justin	Erickson	Justin.Erickson@outlook.com	clogs	faux-leather	red
…						

```

Maybe our Customer Service department has just received a message from Joyce Waller, so we want to know exactly what she ordered. We want to select this single row of data.

DataFrames are zero-indexed, meaning that we start with the 0th row and count up from there. Joyce Waller’s order is the 2nd row.

We select it using the following command:

```python
orders.iloc[2]
```

When we select a single row, the result is a Series (just like when we select a single column).

## Selecting Multiple Rows

You can also select multiple rows from a DataFrame.

Here are a few more rows from ShoeFly.com’s `orders` DataFrame:

```
id	first_name	last_name	email	shoe_type	shoe_material	shoe_color
54791	Rebecca	Lindsay	RebeccaLindsay57@hotmail.com	clogs	faux-leather	black
53450	Emily	Joyce	EmilyJoyce25@gmail.com	ballet flats	faux-leather	navy
91987	Joyce	Waller	Joyce.Waller@gmail.com	sandals	fabric	black
14437	Justin	Erickson	Justin.Erickson@outlook.com	clogs	faux-leather	red
79357	Andrew	Banks	AB4318@gmail.com	boots	leather	brown
52386	Julie	Marsh	JulieMarsh59@gmail.com	sandals	fabric	black
20487	Thomas	Jensen	TJ5470@gmail.com	clogs	fabric	navy
76971	Janice	Hicks	Janice.Hicks@gmail.com	clogs	faux-leather	navy
21586	Gabriel	Porter	GabrielPorter24@gmail.com	clogs	leather	brown
```

Here are some different ways of selecting multiple rows:

- `orders.iloc[3:7]` would select all rows starting at the 3rd row and up to but not including the 7th row (i.e., the 3rd row, 4th row, 5th row, and 6th row)

```
id	first_name	last_name	email	shoe_type	shoe_material	shoe_color
14437	Justin	Erickson	Justin.Erickson@outlook.com	clogs	faux-leather	red
79357	Andrew	Banks	AB4318@gmail.com	boots	leather	brown
52386	Julie	Marsh	JulieMarsh59@gmail.com	sandals	fabric	black
20487	Thomas	Jensen	TJ5470@gmail.com	clogs	fabric	navy
```

- `orders.iloc[:4]` would select all rows up to, but not including the 4th row (i.e., the 0th, 1st, 2nd, and 3rd rows)

```
id	first_name	last_name	email	shoe_type	shoe_material	shoe_color
54791	Rebecca	Lindsay	RebeccaLindsay57@hotmail.com	clogs	faux-leather	black
53450	Emily	Joyce	EmilyJoyce25@gmail.com	ballet flats	faux-leather	navy
91987	Joyce	Waller	Joyce.Waller@gmail.com	sandals	fabric	black
14437	Justin	Erickson	Justin.Erickson@outlook.com	clogs	faux-leather	red
```

- `orders.iloc[-3:]` would select the rows starting at the 3rd to last row and up to and including the final row

```
id	first_name	last_name	email	shoe_type	shoe_material	shoe_color
20487	Thomas	Jensen	TJ5470@gmail.com	clogs	fabric	navy
76971	Janice	Hicks	Janice.Hicks@gmail.com	clogs	faux-leather	navy
21586	Gabriel	Porter	GabrielPorter24@gmail.com	clogs	leather	brown
```

## Select Rows with Logic I

You can select a subset of a DataFrame by using logical statements:

```python
df[df.MyColumnName == desired_column_value]
```

We have a large DataFrame with information about our customers. A few of the many rows look like this:

```
name	address	phone	age
Martha Jones	123 Main St.	234-567-8910	28
Rose Tyler	456 Maple Ave.	212-867-5309	22
Donna Noble	789 Broadway	949-123-4567	35
Amy Pond	98 West End Ave.	646-555-1234	29
Clara Oswald	54 Columbus Ave.	714-225-1957	31
…	…	…	…

```

Suppose we want to select all rows where the customer’s age is 30. We would use:

```python
df[df.age == 30]
```

In Python, `==` is how we test if a value is exactly equal to another value.

We can use other logical statements, such as:

- Greater Than, `>` — Here, we select all rows where the customer’s age is greater than 30:
```python
df[df.age > 30]
```

- Less Than, `<` — Here, we select all rows where the customer’s age is less than 30:
```python
df[df.age < 30]
```

- Not Equal, `!=` — This snippet selects all rows where the customer’s name is not Clara Oswald:
```python
df[df.name != 'Clara Oswald']
```

## Select Rows with Logic II

You can also combine multiple logical statements, as long as each statement is in parentheses.

For instance, suppose we wanted to select all rows where the customer’s age was under 30 or the customer’s name was “Martha Jones”:

```
name	address	phone	age
Martha Jones	123 Main St.	234-567-8910	28
Rose Tyler	456 Maple Ave.	212-867-5309	22
Donna Noble	789 Broadway	949-123-4567	35
Amy Pond	98 West End Ave.	646-555-1234	29
Clara Oswald	54 Columbus Ave.	714-225-1957	31
…			

```
We could use the following code:

```python
df[(df.age < 30) |
   (df.name == 'Martha Jones')]
```

In Python, `|` means “or” and `&` means “and”.

## Select Rows with Logic III

Suppose we want to select the rows where the customer’s name is either “Martha Jones”, “Rose Tyler” or “Amy Pond”.

```
name	address	phone	age
Martha Jones	123 Main St.	234-567-8910	28
Rose Tyler	456 Maple Ave.	212-867-5309	22
Donna Noble	789 Broadway	949-123-4567	35
Amy Pond	98 West End Ave.	646-555-1234	29
Clara Oswald	54 Columbus Ave.	714-225-1957	31
…	…	…	…

```

We could use the isin command to check that df.name is one of a list of values:

```python
df[df.name.isin(['Martha Jones',
     'Rose Tyler',
     'Amy Pond'])]
```

## Setting indices

When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. This is inelegant and makes it hard to use `.iloc()`.

We can fix this using the method `.reset_index()`. For example, here is a DataFrame called `df` with non-consecutive indices:

```
First Name	Last Name
0	John	Smith
4	Jane	Doe
7	Joe	Schmo

```

If we use the command `df.reset_index()`, we get a new DataFrame with a new set of indices:

```
  }index	First Name	Last Name
0	0	John	Smith
1	4	Jane	Doe
2	7	Joe	Schmo

```

Note that the old indices have been moved into a new column called `'index'`. Unless you need those values for something special, it’s probably better to use the keyword `drop=True` so that you don’t end up with that extra column. If we run the command `df.reset_index(drop=True)`, we get a new DataFrame that looks like this:

```
First Name	Last Name
0	John	Smith
1	Jane	Doe
2	Joe	Schmo

```

Using `.reset_index()` will return a new DataFrame, but we usually just want to modify our existing DataFrame. If we use the keyword `inplace=True` we can just modify our existing DataFrame.

## Adding a Column I

Sometimes, we want to add a column to an existing DataFrame. We might want to add new information or perform a calculation based on the data that we already have.

One way that we can add a new column is by giving a list of the same length as the existing DataFrame.

Suppose we own a hardware store called The Handy Woman and have a DataFrame containing inventory information:

```
Product ID	Product Description	Cost to Manufacture	Price
1	3 inch screw	0.50	0.75
2	2 inch nail	0.10	0.25
3	hammer	3.00	5.50
4	screwdriver	2.50	3.00
```

It looks like the actual quantity of each product in our warehouse is missing!

Let’s use the following code to add that information to our DataFrame.

```python
df['Quantity'] = [100, 150, 50, 35]
```

Our new DataFrame looks like this:

```
Product ID	Product Description	Cost to Manufacture	Price	Quantity
1	3 inch screw	0.50	0.75	100
2	2 inch nail	0.10	0.25	150
3	hammer	3.00	5.50	50
4	screwdriver	2.50	3.00	35
```


## Adding a Column II

We can also add a new column that is the same for all rows in the DataFrame. Let’s return to our inventory example:

```
Product ID	Product Description	Cost to Manufacture	Price
1	3 inch screw	0.50	0.75
2	2 inch nail	0.10	0.25
3	hammer	3.00	5.50
4	screwdriver	2.50	3.00

```
Suppose we know that all of our products are currently in-stock. We can add a column that says this:

```python
df['In Stock?'] = True
```

Now all of the rows have a column called `In Stock?` with value `True`.

```
Product ID	Product Description	Cost to Manufacture	Price	In Stock?
1	3 inch screw	0.50	0.75	True
2	2 inch nail	0.10	0.25	True
3	hammer	3.00	5.50	True
4	screwdriver	2.50	3.00	True
```

## Adding a Column III

Finally, you can add a new column by performing a function on the existing columns.

Maybe we want to add a column to our inventory table with the amount of sales tax that we need to charge for each item. The following code multiplies each `Price` by `0.075`, the sales tax for our state:

```python
df['Sales Tax'] = df.Price * 0.075
```

Now our table has a column called Sales Tax:

```
Product ID	Product Description	Cost to Manufacture	Price	Sales Tax
1	3 inch screw	0.50	0.75	0.06
2	2 inch nail	0.10	0.25	0.02
3	hammer	3.00	5.50	0.41
4	screwdriver	2.50	3.00	0.22

```

## Performing Column Operations

In the previous exercise, we learned how to add columns to a DataFrame.

Often, the column that we want to add is related to existing columns, but requires a calculation more complex than multiplication or addition.

For example, imagine that we have the following table of customers.

```
Name	Email
JOHN SMITH	john.smith@gmail.com
Jane Doe	jdoe@yahoo.com
joe schmo	joeschmo@hotmail.com

```

It’s a little annoying that the capitalization is different for each row. Perhaps we’d like to make it more consistent by making all of the letters uppercase.

We can use the `apply` function to apply a function to every value in a particular column. For example, this code overwrites the existing `'Name'` columns by applying the function `upper` to every row in `'Name'`.

```python
df['Name'] = df.Name.apply(str.upper)
```

The result:

```
Name	Email
JOHN SMITH	john.smith@gmail.com
JANE DOE	jdoe@yahoo.com
JOE SCHMO	joeschmo@hotmail.com

```

## Lambda Function

A _lambda function_ is a way of defining a function in a single line of code. Usually, we would assign them to a variable.

For example, the following lambda function multiplies a number by 2 and then adds 3:

```python
mylambda = lambda x: (x * 2) + 3
print(mylambda(5))
```

The output:

```
13
```

Let’s break this syntax down:

1. The function is stored in a variable called `mylambda`
2. `lambda` declares that this is a lambda function (if you are familiar with normal Python functions, this is similar to how we use def to declare a function)
3. `x` is what we call the input we are passing into `mylambda`
4. We are returning `(x * 2) + 3` (with normal Python functions, we use the keyword return)

Lambda functions only work if we’re just doing a one line command. If we wanted to write something longer, we’d need a more complex function. Lambda functions are great when you need to use a function once. Because you aren’t defining a function, the reusability aspect functions is not present with lambda functions. By saving the work of defining a function, a lambda function allows us to efficiently run an expression and produce an output for a specific task, such as defining a column in a table, or populating information in a dictionary.

Lambda functions work with all types of variables, not just integers! Here is an example that takes in a string, assigns it to the temporary variable x, and then converts it into lowercase:

```python
stringlambda = lambda x: x.lower()
print(stringlambda("Oh Hi Mark!"))
```

The output:

```
"oh hi mark!"
```

We can make our lambdas more complex by using a modified form of an if statement.

Suppose we want to pay workers time-and-a-half for overtime (any work above 40 hours per week). The following function will convert the number of hours into time-and-a-half hours using an if statement:

```python
def myfunction(x):
    if x > 40:
        return 40 + (x - 40) * 1.50
    else:
        return x
```

Below is a lambda function that does the same thing:

```python
myfunction = lambda x: 40 + (x - 40) * 1.50 if x > 40 else x
```

In general, the syntax for an if function in a lambda function is:

```
lambda x: [OUTCOME IF TRUE] if [CONDITIONAL] else [OUTCOME IF FALSE]
```

## Applying a Lambda to a Column

In Pandas, we often use lambda functions to perform complex operations on columns. For example, suppose that we want to create a column containing the email provider for each email address in the following table:

```
Name	Email
JOHN SMITH	john.smith@gmail.com
Jane Doe	jdoe@yahoo.com
joe schmo	joeschmo@hotmail.com

```

We could use the following code with a lambda function and the string method .split():

```python
df['Email Provider'] = df.Email.apply(
    lambda x: x.split('@')[-1]
    )
```

The result would be:

```
Name	Email	Email Provider
JOHN SMITH	john.smith@gmail.com	gmail.com
Jane Doe	jdoe@yahoo.com	yahoo.com
joe schmo	joeschmo@hotmail.com	hotmail.com
```


## Applying a Lambda to a Row

We can also operate on multiple columns at once. If we use `apply` without specifying a single column and add the argument `axis=1`, the input to our lambda function will be an entire row, not a column. To access particular values of the row, we use the syntax `row.column_name` or `row[‘column_name’]`.

Suppose we have a table representing a grocery list:

```
Item	Price	Is taxed?
Apple	1.00	No
Milk	4.20	No
Paper Towels	5.00	Yes
Light Bulbs	3.75	Yes
```

If we want to add in the price with tax for each line, we’ll need to look at two columns: `Price` and `Is taxed?`.

If `Is taxed?` is `Yes`, then we’ll want to multiply `Price` by 1.075 (for 7.5% sales tax).

If `Is taxed?` is `No`, we’ll just have `Price` without multiplying it.

We can create this column using a lambda function and the keyword `axis=1`:

```python
df['Price with Tax'] = df.apply(lambda row:
     row['Price'] * 1.075
     if row['Is taxed?'] == 'Yes'
     else row['Price'],
     axis=1
)
```

## Renaming Columns

When we get our data from other sources, we often want to change the column names. For example, we might want all of the column names to follow variable name rules, so that we can use `df.column_name` (which tab-completes) rather than `df['column_name']` (which takes up extra space).

You can change all of the column names at once by setting the `.columns` property to a different list. This is great when you need to change all of the column names at once, but be careful! You can easily mislabel columns if you get the ordering wrong. Here’s an example:

```python
df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
df.columns = ['First Name', 'Age']

```

This command edits the **existing** DataFrame `df`.

## Renaming Columns II

You also can rename individual columns by using the `.rename` method. Pass a dictionary like the one below to the `columns` keyword argument:

```python
{'old_column_name1': 'new_column_name1', 'old_column_name2': 'new_column_name2'}
```

Here’s an example:

```python
df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
df.rename(columns={
    'name': 'First Name',
    'age': 'Age'},
    inplace=True)
```

The code above will rename `name` to `First Name` and `age` to `Age`.

Using `rename` with only the `columns` keyword will create a **new** DataFrame, leaving your original DataFrame unchanged. That’s why we also passed in the keyword argument **`inplace=True`**. Using `inplace=True` lets us edit the original DataFrame.

There are several reasons why `.rename` is preferable to `.columns`:

- You can rename just one column
- You can be specific about which column names are getting changed (with `.column` you can accidentally switch column names if you’re not careful)

**Note:** If you misspell one of the original column names, this command won’t fail. It just won’t change anything.