# Selecting Rows and Columns

Before we go into statistical and inferential methods in cleaning data, we will start with some more basic data cleaning operations. When you import a dataset you may judge certain columns and rows are just not needed, or you want to narrow your focus to a specific scope. We will use this section to learn different methods for selecting rows and columns using indexing and conditional methods. 

First let's declare a dataframe of some users. 

In [339]:
import pandas as pd 

df = pd.DataFrame({
    "username" : ["thomasnield","samiam", "joecool"],
    "first_name": ["Thomas", 'Sam', 'Joe'], 
    "last_name": ["Nield", 'Scala', 'Morrison'], 
    "email": ["tmnield@outlook.com", 'sam.scala@gmail.com', 'joe@rexonmetals.com']
})

df

Unnamed: 0,username,first_name,last_name,email
0,thomasnield,Thomas,Nield,tmnield@outlook.com
1,samiam,Sam,Scala,sam.scala@gmail.com
2,joecool,Joe,Morrison,joe@rexonmetals.com


## Understanding iloc and loc 

There are two critical functions to know in Pandas when you are selecting by index: `loc` and `iloc`. It is very easy to confuse these two, as the first works on labels and the second for numeric indices. 

Here is where people get confused. Let's say we want to select the first record. We can use both `loc` and `iloc` to do this, and they both produce the same answer.

> Remember that Python and Pandas uses 0-based indexing, meaning the first element will start at index 0 rather than index 1! 

In [343]:
df.iloc[0]

username              thomasnield
first_name                 Thomas
last_name                   Nield
email         tmnield@outlook.com
Name: 0, dtype: object

In [345]:
df.loc[0]

username              thomasnield
first_name                 Thomas
last_name                   Nield
email         tmnield@outlook.com
Name: 0, dtype: object

It seems `loc` and `iloc` do not behave any differently, and this is where people get tripped up. Let's change the index to be the `username`. 

In [348]:
df.set_index('username', inplace=True)

Now try to run `loc` and `iloc` again. Notice that `loc` works fine, but `iloc` no longer does!

In [351]:
df.iloc[0]

first_name                 Thomas
last_name                   Nield
email         tmnield@outlook.com
Name: thomasnield, dtype: object

In [353]:
df.loc[0] # this will cause an error 

KeyError: 0

This is because `iloc` looks up a row by a numeric index, and that is what you should use if that is your intent. The `loc` uses the labelled index which earlier (by default) is also a numeric index, but we then changed it to the `username`. 

Therefore, if we looked up by an actual `username` value such as "thomasnield" then the `loc` function will work. 

In [356]:
df.loc['thomasnield']

first_name                 Thomas
last_name                   Nield
email         tmnield@outlook.com
Name: thomasnield, dtype: object

## Selecting Ranges 

We can also look up multiple rows at multiple indices, whether they are numeric or labels. We can use the Python range operator `:` to get a range of numeric positions or labels (if the labels have ordering behavior). Example: I can get the first and second rows.

In [360]:
df.iloc[0:2]

Unnamed: 0_level_0,first_name,last_name,email
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
thomasnield,Thomas,Nield,tmnield@outlook.com
samiam,Sam,Scala,sam.scala@gmail.com


If you expected the third row to be included because it has an index of 2, and we selected range `0:2`, this is why it was not included. The end of the range is exclusive and omits that last element in the selection. Another way to think of it is we are selecting the indices *between* each digit. This is usually helpful for me and here is a visual to demonstrate grabbing the first two elements of a collection. 

svg image

Whenever there is a 0 in a range, we can omit it and it will be implied.

In [364]:
df.iloc[:2]

Unnamed: 0_level_0,first_name,last_name,email
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
thomasnield,Thomas,Nield,tmnield@outlook.com
samiam,Sam,Scala,sam.scala@gmail.com


If we leave the end value off as well, that will extend to the end of the range. Below we grab everything from the second record and after. 

In [367]:
df.iloc[1:]

Unnamed: 0_level_0,first_name,last_name,email
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
samiam,Sam,Scala,sam.scala@gmail.com
joecool,Joe,Morrison,joe@rexonmetals.com


If we provide just a brackets with a colon inside, it will select all rows. This may seem pointless, but it will serve as a placeholder when we select columns shortly. 

In [370]:
df.iloc[:]

Unnamed: 0_level_0,first_name,last_name,email
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
thomasnield,Thomas,Nield,tmnield@outlook.com
samiam,Sam,Scala,sam.scala@gmail.com
joecool,Joe,Morrison,joe@rexonmetals.com


Now we can provide a second range to get certain columns, but already specifying to include all rows. Below we grab all rows and the second through third columns. 

In [373]:
df.iloc[:, 1:3]

Unnamed: 0_level_0,last_name,email
username,Unnamed: 1_level_1,Unnamed: 2_level_1
thomasnield,Nield,tmnield@outlook.com
samiam,Scala,sam.scala@gmail.com
joecool,Morrison,joe@rexonmetals.com


### Negative Index

We can also use a negative index to grab rows or columns from the opposite direction, like grabbing the last two columns.

In [377]:
df.iloc[:,-2:]

Unnamed: 0_level_0,last_name,email
username,Unnamed: 1_level_1,Unnamed: 2_level_1
thomasnield,Nield,tmnield@outlook.com
samiam,Scala,sam.scala@gmail.com
joecool,Morrison,joe@rexonmetals.com


Another negative index example: using `-1` to specify grabbing the last row or the last column. 

In [380]:
df.iloc[-1]

first_name                    Joe
last_name                Morrison
email         joe@rexonmetals.com
Name: joecool, dtype: object

In [382]:
df.iloc[:,-1]

username
thomasnield    tmnield@outlook.com
samiam         sam.scala@gmail.com
joecool        joe@rexonmetals.com
Name: email, dtype: object

## Picking Rows and Columns

To get extra picky, we can provide a list of indices instead of a range to pick only certain columns or certain rows. Below get the second and third row, and the first and third columns. 

In [386]:
df.iloc[1:3, [0,2]]

Unnamed: 0_level_0,first_name,email
username,Unnamed: 1_level_1,Unnamed: 2_level_1
samiam,Sam,sam.scala@gmail.com
joecool,Joe,joe@rexonmetals.com


There is a `loc` equivalent to this cherrypicking as well, where we can provide a list of labels we are interested in. Below I grab the rows with usernames `samiam` and `thomasnield` then extract the `email` column.

In [389]:
df.loc[["samiam","thomasnield"], "email"]

username
samiam         sam.scala@gmail.com
thomasnield    tmnield@outlook.com
Name: email, dtype: object

## Resetting the Index

You can reset the index back its default behavior by calling `reset_index()`. Make sure to use the `inplace=True` argument so it replaces the existing DataFrame rather than create a new one. 

In [393]:
df.reset_index(inplace=True)

> Similar to `loc` and `iloc` there is also an `at` and `iat`. These return a single value at a specific row and column index using numeric or labelled indices respectively.

## Dropping Rows by Condition

Let's look at how to filter out rows and columns. There are multiple ways to do this. Let's talk about logical operators first. 

Notice how we can extract a column, use the `str` property, and get string-related methods. Let's use `startswith()` and find usernames that begin with the letter "s."

In [398]:
df["username"].str.startswith("s")

0    False
1     True
2    False
Name: username, dtype: bool

The result might not be something you expect. We got a series of boolean `True/False` values indicating whether that value matches that condition. 

You might be wanting to simply list records that evaluated to `True`. We can achieve that by passing that series of `True/False` values back into the DataFrame and then it will only yield records that match `True` in that index. 

In [401]:
condition = df["username"].str.startswith("s")

df[condition]

Unnamed: 0,username,first_name,last_name,email
1,samiam,Sam,Scala,sam.scala@gmail.com


You can also just embed that logical expression inside the DataFrame getter brackets. 

In [404]:
df[df["username"].str.startswith("s")]

Unnamed: 0,username,first_name,last_name,email
1,samiam,Sam,Scala,sam.scala@gmail.com


We can also use the `&` and `|` to perform *and* and *or* operations respectively with two or more conditions. 

In [407]:
df[df["username"].str.startswith("s") & df["email"].str.contains("gmail")]

Unnamed: 0,username,first_name,last_name,email
1,samiam,Sam,Scala,sam.scala@gmail.com


# Dropping Columns and Rows


There will be times you want to drop rows and columns that are not needed for your task. This is what the `drop()` function is for. 

Below I drop the first and second rows from my DataFrame. Because I want to drop rows, I specify the `axis=0`. 

In [411]:
df.drop([0,1], axis=0)

Unnamed: 0,username,first_name,last_name,email
2,joecool,Joe,Morrison,joe@rexonmetals.com


> As always, while not being done here, use `inplace=True` if you want it to replace the existing DataFrame.

Note carefully that this uses the index. Therefore if you have different labels than a typical numeric index, you will need to specify with those labels.

Below I set the index of my DataFrame to use the `username` and then drop those two rows by those usernames. 

In [415]:
df.set_index("username").drop(["thomasnield","samiam"], axis=0)

Unnamed: 0_level_0,first_name,last_name,email
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
joecool,Joe,Morrison,joe@rexonmetals.com


You can also use `drop()` to remove columns. Below I specify columns by `axis=1` and drop the `username` and `email` columns from the DataFrame.

In [418]:
df.drop(["username", "email"],axis=1)

Unnamed: 0,first_name,last_name
0,Thomas,Nield
1,Sam,Scala
2,Joe,Morrison


If you want to drop columns by a numeric index, you will need to retrieve that column name by grabbing the `columns` index object. Below we delete the first and fourth columns in our DataFrame by looking up their corresponding column labels, and then packaging it into a list. 

In [421]:
num_indices = [0,3]
df.drop([df.columns[i] for i in num_indices], axis=1)

Unnamed: 0,first_name,last_name
0,Thomas,Nield
1,Sam,Scala
2,Joe,Morrison


## Appending Rows and Columns 

### Appending Columns

Appending a column to a DataFrame can be done in several ways. The simplest is to define the new column label inside the square brackets like `df["phone"]` and then assign a simple list, a dictionary, a Series, or another DataFrame. 

Below we create a new `phone` column and apply the data using a simple list. The number of values must match the number of records. 

In [425]:
df["phone"] = ["213-247-5724","754-238-8237","233-555-2311"]

df


Unnamed: 0,username,first_name,last_name,email,phone
0,thomasnield,Thomas,Nield,tmnield@outlook.com,213-247-5724
1,samiam,Sam,Scala,sam.scala@gmail.com,754-238-8237
2,joecool,Joe,Morrison,joe@rexonmetals.com,233-555-2311


If you want to add a column at a specific location, you can use the `insert()` function. Provide first the positional index and the column name, and then a list of values. 

Below we add a `twitter` column in the fourth column position of the DataFrame.

In [428]:
df.insert(3, "twitter", ["@thomasnield76","@samiam46","@joe564"])

df

Unnamed: 0,username,first_name,last_name,twitter,email,phone
0,thomasnield,Thomas,Nield,@thomasnield76,tmnield@outlook.com,213-247-5724
1,samiam,Sam,Scala,@samiam46,sam.scala@gmail.com,754-238-8237
2,joecool,Joe,Morrison,@joe564,joe@rexonmetals.com,233-555-2311


### Appending Rows

Adding rows can be done similarly as adding columns, but uses the `loc` property. You will need to provide the index label for that new record and assign it. 

In [432]:
df.loc[3] = ["jasonmarley", "Jason", "Marley", "@OtherJason45", "jason@marlyco.net", "214-282-9998"]
df

Unnamed: 0,username,first_name,last_name,twitter,email,phone
0,thomasnield,Thomas,Nield,@thomasnield76,tmnield@outlook.com,213-247-5724
1,samiam,Sam,Scala,@samiam46,sam.scala@gmail.com,754-238-8237
2,joecool,Joe,Morrison,@joe564,joe@rexonmetals.com,233-555-2311
3,jasonmarley,Jason,Marley,@OtherJason45,jason@marlyco.net,214-282-9998


You can also use the `concat()` function to append one or more dataframes together.

In [435]:
new_record = pd.DataFrame({
    "username":["wyatt_urp"],
    "first_name":["Wyatt"],
    "last_name":["Jones"], 
    "email":["wyatt.jones@jonesco.com"],
    "twitter": [None],
    "phone":["444-244-7642"]
})
df = pd.concat([df, new_record])

df

Unnamed: 0,username,first_name,last_name,twitter,email,phone
0,thomasnield,Thomas,Nield,@thomasnield76,tmnield@outlook.com,213-247-5724
1,samiam,Sam,Scala,@samiam46,sam.scala@gmail.com,754-238-8237
2,joecool,Joe,Morrison,@joe564,joe@rexonmetals.com,233-555-2311
3,jasonmarley,Jason,Marley,@OtherJason45,jason@marlyco.net,214-282-9998
0,wyatt_urp,Wyatt,Jones,,wyatt.jones@jonesco.com,444-244-7642


## Updating Data 

### Updating a Column

You can update an entire column in Pandas by using the `=` operator. Below we update all the `email` values to be uppercase. 

In [440]:
df["email"] = df["email"].str.upper()

df

Unnamed: 0,username,first_name,last_name,twitter,email,phone
0,thomasnield,Thomas,Nield,@thomasnield76,TMNIELD@OUTLOOK.COM,213-247-5724
1,samiam,Sam,Scala,@samiam46,SAM.SCALA@GMAIL.COM,754-238-8237
2,joecool,Joe,Morrison,@joe564,JOE@REXONMETALS.COM,233-555-2311
3,jasonmarley,Jason,Marley,@OtherJason45,JASON@MARLYCO.NET,214-282-9998
0,wyatt_urp,Wyatt,Jones,,WYATT.JONES@JONESCO.COM,444-244-7642


### Updating On a Condition 

We can also conditionally update one or more specific records by passing a logical condition to the `loc` function as well as the desired column to be updated. 

Below we change joe's email address. 

In [444]:
condition = df["first_name"].eq("Joe")

df.loc[condition, "email"] = "joe@gmail.com"

df

Unnamed: 0,username,first_name,last_name,twitter,email,phone
0,thomasnield,Thomas,Nield,@thomasnield76,TMNIELD@OUTLOOK.COM,213-247-5724
1,samiam,Sam,Scala,@samiam46,SAM.SCALA@GMAIL.COM,754-238-8237
2,joecool,Joe,Morrison,@joe564,joe@gmail.com,233-555-2311
3,jasonmarley,Jason,Marley,@OtherJason45,JASON@MARLYCO.NET,214-282-9998
0,wyatt_urp,Wyatt,Jones,,WYATT.JONES@JONESCO.COM,444-244-7642


### Updating a Row

You can also use an index for a row, and assign a new record for that row. 

In [448]:
df.loc[1] = ['samiam2022','Samuel','Scala',None,'sam.scala@gmail.com',None]

df

Unnamed: 0,username,first_name,last_name,twitter,email,phone
0,thomasnield,Thomas,Nield,@thomasnield76,TMNIELD@OUTLOOK.COM,213-247-5724
1,samiam2022,Samuel,Scala,,sam.scala@gmail.com,
2,joecool,Joe,Morrison,@joe564,joe@gmail.com,233-555-2311
3,jasonmarley,Jason,Marley,@OtherJason45,JASON@MARLYCO.NET,214-282-9998
0,wyatt_urp,Wyatt,Jones,,WYATT.JONES@JONESCO.COM,444-244-7642


## Melting

There will be situations in data cleaning where you will want to unpivot columns to become row values. For example, take this simple but possibly inconvenient DataFrame that breaks up a revenue amount by category, but also by the years `2022` and `2021` as two separate columns. 

In [452]:
df = pd.DataFrame({
    "CATEGORY" : ["ALPHA", "BETA", "GAMMA"],
    "2022" : [120, 250, 280],
    "2021" : [320, 420, 170]
})

df

Unnamed: 0,CATEGORY,2022,2021
0,ALPHA,120,320
1,BETA,250,420
2,GAMMA,280,170


This is not exactly normalized data, where we would expect a `YEAR` column and then `2022` and `2021` would be values in that column. 

Thankfully, the `melt()` function in Pandas will do this transformation. Let's take a look and see if you can dissect how this function is used, and what its arguments mean.

In [455]:
df.melt(id_vars=['CATEGORY'], value_vars=['2022','2021'], var_name="YEAR", value_name="AMOUNT")

Unnamed: 0,CATEGORY,YEAR,AMOUNT
0,ALPHA,2022,120
1,BETA,2022,250
2,GAMMA,2022,280
3,ALPHA,2021,320
4,BETA,2021,420
5,GAMMA,2021,170


As you will see above, the `id_vars` is essentially the columns that will be unchanged and "anchor" the records. But the `value_vars` will be the column names that will be flipped into values for a column we name `YEAR`, which is also an argument `var_name`. The `value_name` argument will provide a name for that column all the values will be moved to. 

## Exercise 

Complete the code below so the `employee_id` is the index for the dataframe. Make the `department` field uppercase, and then add a column `birth_date` with values `1985-04-02`, `1987-05-16`, and `1979-01-03`. Then delete the record where the `employee_id` is `c275213`. 

In [460]:
import pandas as pd 

df = pd.DataFrame({
    'employee_id' : ['e947521','e875624','c275213'],
    'hire_date' : ['2019-12-01','2018-05-21','2018-11-03'],
    'department' : ['marketing', 'research','accounting']
})

# set the index 
df.set_index('employee_id', inplace=True)

# make 'department' column uppercase
df['department'] = df['department'].str.upper()

# add birth_date column
df['birth_date'] = ['1985-04-02', '1987-05-16', '1979-01-03']

# drop record with 'employee_id' of c275213
df.drop(["c275213"], inplace=True)

# display the DataFrame
df

Unnamed: 0_level_0,hire_date,department,birth_date
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
e947521,2019-12-01,MARKETING,1985-04-02
e875624,2018-05-21,RESEARCH,1987-05-16
