# DataFrame Structure Methods

In this chapter, we cover several different methods that change the structure of the DataFrame. We will be adding and dropping rows and columns from our DataFrame and renaming the labels for both the rows and columns.

## Adding a new column to the DataFrame

A new column may be added to a DataFrame using similar syntax as selecting a single column with the brackets. This is done without the use of a method. The general syntax will look like the following:

```
>>> df['new_column'] = <some expression>
```

Let's begin by reading in the college dataset with the institution name set as the index. We'll use a small subset of this DataFrame consisting of the first three rows and six of the columns.

In [None]:
import pandas as pd
college = pd.read_csv('../data/college.csv', index_col='instnm')
cols = ['city', 'stabbr', 'relaffil', 'satvrmid', 'satmtmid', 'ugds']
cs = college[cols].head(3)
cs

### Reading in a subset of columns with the `usecols` parameter

The above set of commands is suboptimal. Instead of reading in all of the columns in the college dataset with the `read_csv` function, we can choose a subset to read with the `usecols` parameter. Pass it a list of the columns we want to read in. We can also use the `nrows` parameter to only read in exactly `n` rows.

In [None]:
cols = ['instnm', 'city', 'stabbr', 'relaffil', 'satvrmid', 'satmtmid', 'ugds']
cs = pd.read_csv('../data/college.csv', index_col='instnm', usecols=cols, nrows=3)
cs

Let's add the two SAT columns together and assign the result as a new column. The new column will always be appended to the end.

In [None]:
cs['sat_total'] = cs['satmtmid'] + cs['satvrmid']
cs

### Setting a column equal to a scalar value

You can create a new column by assigning it to be a single scalar value. For instance, the following assignment creates a new column of values equal to the number -99.

In [None]:
cs['some_num'] = -99
cs

### Overwriting an existing column

You can replace the contents of an existing column by assigning it to some other value. Below, we increase the undergraduate population of each college by 10%.

In [None]:
cs['ugds'] = cs['ugds'] * 1.1
cs

### Create a new column from a numpy array

You can create a new column by assigning it to a numpy array (or another Python sequence) that is the same length as the DataFrame. Below, we create a column of random normal variables.

In [None]:
import numpy as np
cs['random_normal'] = np.random.randn(len(cs))
cs

## Copying a DataFrame

The `copy` method is available to make a completely new copy of a DataFrame that is not associated with the original. This is necessary because assigning a DataFrame to a new variable does not copy it. Let's read in a sample DataFrame and assign it to the variable name `df`.

In [None]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

We can create a new variable name by assigning it to `df`. This does not make a new copy of the data.

In [None]:
df1 = df
df1

If you are unfamiliar with Python, you might make the mistake and assume that `df` and `df1` reference different DataFrames. What we have is a single DataFrame object that is referenced by two different variable names. We can prove this with the `is` operator.

In [None]:
df is df1

Let's prove this by modifying `df` by adding a new column to it.

In [None]:
df['new_col'] = 5
df

Let's now output `df1` to show that it too has changed.

In [None]:
df1

The variables `df` and `df1` are just two different names that reference the same underlying DataFrame. If you'd like to create a completely new DataFrame with the same data, you need to use the `copy` method. Let's reread in the same dataset again, but this time assign `df1` to a copy of `df`.

In [None]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df1 = df.copy()

Testing whether `df` and `df1` reference the same DataFrame will result now yield `False`.

In [None]:
df is df1

If we add a column to `df` it will have no effect on `df1`.

In [None]:
df['new_col'] = 5
df

Outputting `df1` shows that it is unchanged.

In [None]:
df1

##  Column and Row Dropping and Renaming

pandas provides the methods `drop` and `rename` to drop and rename columns and rows.

### Dropping Columns

The `drop` method drops columns passed to the `columns` parameter as either a string or a list of strings. Let's see examples of dropping a single column and then multiple columns. Remember that methods return completely new objects so the original DataFrame is not affected. You'll need to assign the result of the operation to a new variable name if you'd like to proceed with the slimmer DataFrame.

In [None]:
cs.drop(columns='city').head(2)

Use a list to drop multiple columns.

In [None]:
cols = ['city', 'stabbr', 'satvrmid']
cs.drop(columns=cols).head(2)

You can also drop rows by **label** and not integer location with the `drop` method using a single label or a list of labels.

In [None]:
rows = ['Alabama A & M University', 'University of Alabama at Birmingham']
cs.drop(index=rows).head(3)

### Renaming Columns

The `rename` method is used to rename columns. Pass a dictionary to the `columns` parameter with keys equal to the old column name and values equal to the new column name. The college dataset has lots of columns with abbreviations that are not immediately recognized. Below, we replace a couple of these columns with more explicit names.

In [None]:
cs.rename(columns={'stabbr': 'state_abbreviation',
                        'relaffil': 'religious_affiliation'}).head(2)

### Renaming all columns at once

Instead of using the `rename` method to rename individual columns, you can assign the `columns` attribute a list of the new column names. The length of the list must be the same as the number of columns. Let's first save the original column names to their own variable name so that we can use them in the future.

In [None]:
orig_cols = cs.columns
orig_cols

Let's overwrite all of the old columns by assigning them to a list of new column names.

In [None]:
cs.columns = ['CITY', 'STATE', 'RELAFFIL', 'SATVERBAL', 'SATMATH', 'UGDS', 'SAT_TOTAL', 'SOME_NUM', 'RN']
cs

Let's overwrite these column names again so that they are back to the original names.

In [None]:
cs.columns = orig_cols
cs

## Inserting columns in the middle of a DataFrame

We previously learned about adding a new column to a DataFrame using just the brackets. New columns are always appended to the end of the DataFrame. You can instead use the `insert` method to place the new column in a specific location other than the end. This method has the following three required parameter:

* `loc` - the integer location of the new column
* `column` - the name of the new column
* `value` - the values of the new column

This method works **in-place** and is one of the only ones that does so by default. This means that the calling DataFrame gets modified and nothing is returned. There is no assignment statement when using `insert`. Let's insert the same SAT total right after the `satmtmid` column. We will call it `sat_total_insert` to differentiate it from the column on the end.

In [None]:
new_vals = cs['satvrmid'] + cs['satmtmid']
cs.insert(5, 'sat_total_insert', new_vals)
cs

One minor annoyance is that you must know the integer location of where you'd like to insert the new column. In the above example, its easy-enough to just count, but a more automated solution would be nice. The pandas Index object has a method called `get_loc` which returns the integer location of a column name. 

This is a rare instance in this book where an Index method is used. I advise not digging into Index objects unless there is some very specialized need. So, with some hesitation, I present the `get_loc` Index method here. First, access the `columns` attribute (which is an Index object) and pass the `get_loc` method the name the column.

In [None]:
cs.columns.get_loc('satmtmid')

Make note that the `get_loc` method does not exist for Series or DataFrame objects. It is strictly an Index method available to either the index or the columns.

### Comparison to Python lists

The DataFrame `insert` method is analogous to a Python list method with the same name. It too inserts a value into the list in-place given an integer location. Let's complete an example to compare how it works.

In [None]:
a = ['some', 'list', 'of', 'strings']
a

Call the list `insert` method which mutates the list in-place.

In [None]:
a.insert(1, 'short')
a

There's also an `index` method that returns the integer location of a particular item in the list which is analogous to the `get_loc` method.

In [None]:
a.index('of')

## The `pop` method

The DataFrame `pop` method removes a single column from a DataFrame and returns it as a Series. This is different than the `drop` method which removes a column or columns and returns a new DataFrame of the remaining columns. The `pop` method modifies the calling DataFrame in-place. Below, we remove the `ugds` column and assign it to a variable with the same name.

In [None]:
ugds = cs.pop('ugds')
ugds

The `cs` DataFrame no longer contains the `ugds` column.

In [None]:
cs

## Exercises

Run the cell below to create a variable name `college_all` that contains all of the rows of the college dataset along with six of the columns. We use the `

In [None]:
cols = ['instnm', 'city', 'stabbr', 'relaffil', 'satvrmid', 'satmtmid', 'ugds']
college_all = pd.read_csv('../data/college.csv', index_col='instnm', usecols=cols)
college_all.head()

In [None]:
college_all.shape

### Exercise 1

<span  style="color:green; font-size:16px">Create a new boolean column in the `college_all` DataFrame named 'Verbal Higher' that is True for every college that has a higher verbal than math SAT score. Find the mean of this new column. Why does this number look suspiciously low?</span>

### Exercise 2

<span  style="color:green; font-size:16px">Find the real percentage of schools with higher verbal than math SAT scores.</span>

### Exercise 3

<span  style="color:green; font-size:16px">Create a new column called 'median all' that has every value set to the median population of all the schools.</span>

### Exercise 4

<span  style="color:green; font-size:16px">Rename the row label 'Texas A &amp; M University-College Station' to 'TAMU'. Reassign the result back to `college_all` and then select this row as a Series.</span>

### Exercise 5

<span  style="color:green; font-size:16px">Create a new column `bonus` right after the salary column equal to 10% of the salary. Round the bonus to the nearest thousand.</span>

### Exercise 6

<span  style="color:green; font-size:16px">Read in the college dataset and set `instnm` as the index and assign it to the variable name `college1`. Use the `copy` method to create a new copy of the `college` DataFrame and assign it to variable `college2`. Select all the non-white race columns (`ugds_black` through `ugds_unkn`).  Sum the rows of this DataFrame and assign the result to a variable. Now drop all the non-white race columns from the `college2` DataFrame and assign the result to `college3`. </span>
    
<span  style="color:green; font-size:16px">Use the `insert` method to insert a new column to the right of the `ugds_white` column of the `college3` DataFrame. Name this column `ugds_nonwhite`.</span>