# Data Manipulation with pandas

In [1]:
# Setup header
import pandas as pd

homelessness = pd.read_csv('homelessness.csv', index_col=0)

## Transforming `DataFrame`s

### Inspecting a `DataFrame`

Here are a few ways to summarize pandas `DataFrame`s. `.head()` gives a quick rundown of what the `DataFrame` looks like without any attempt at summarizing the data:

In [2]:
homelessness.head()

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570.0,864.0,4887681
1,Pacific,Alaska,1434.0,582.0,735139
2,Mountain,Arizona,7259.0,2606.0,7158024
3,West South Central,Arkansas,2280.0,432.0,3009733
4,Pacific,California,109008.0,20964.0,39461588


`.info()` will show the type of each column and give the user an idea of how many missing values there are:

In [3]:
homelessness.info()

<class 'pandas.core.frame.DataFrame'>
Index: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   region          51 non-null     object 
 1   state           51 non-null     object 
 2   individuals     51 non-null     float64
 3   family_members  51 non-null     float64
 4   state_pop       51 non-null     int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 2.4+ KB


`.shape` is a tuple with the number of rows and columns in the `DataFrame`. Note that `.shape` is an attribute rather than a method:

In [4]:
homelessness.shape

(51, 5)

Lastly `.describe()` gives summary statistics of the *numerical* columns:

In [5]:
homelessness.describe()

Unnamed: 0,individuals,family_members,state_pop
count,51.0,51.0,51.0
mean,7225.784314,3504.882353,6405637.0
std,15991.025083,7805.411811,7327258.0
min,434.0,75.0,577601.0
25%,1446.5,592.0,1777414.0
50%,3082.0,1482.0,4461153.0
75%,6781.5,3196.0,7340946.0
max,109008.0,52070.0,39461590.0


### Parts of a `DataFrame`

`.values` gives a "raw" NumPy array of the content of the `DataFrame`:

In [6]:
homelessness.values

array([['East South Central', 'Alabama', 2570.0, 864.0, 4887681],
       ['Pacific', 'Alaska', 1434.0, 582.0, 735139],
       ['Mountain', 'Arizona', 7259.0, 2606.0, 7158024],
       ['West South Central', 'Arkansas', 2280.0, 432.0, 3009733],
       ['Pacific', 'California', 109008.0, 20964.0, 39461588],
       ['Mountain', 'Colorado', 7607.0, 3250.0, 5691287],
       ['New England', 'Connecticut', 2280.0, 1696.0, 3571520],
       ['South Atlantic', 'Delaware', 708.0, 374.0, 965479],
       ['South Atlantic', 'District of Columbia', 3770.0, 3134.0, 701547],
       ['South Atlantic', 'Florida', 21443.0, 9587.0, 21244317],
       ['South Atlantic', 'Georgia', 6943.0, 2556.0, 10511131],
       ['Pacific', 'Hawaii', 4131.0, 2399.0, 1420593],
       ['Mountain', 'Idaho', 1297.0, 715.0, 1750536],
       ['East North Central', 'Illinois', 6752.0, 3891.0, 12723071],
       ['East North Central', 'Indiana', 3776.0, 1482.0, 6695497],
       ['West North Central', 'Iowa', 1711.0, 1038.0, 3148618]

`.columns` gives a pandas `Index` object for the columns of the `DataFrame`:

In [7]:
homelessness.columns

Index(['region', 'state', 'individuals', 'family_members', 'state_pop'], dtype='object')

`.index` gives an index for the rows, which will consist either of row numbers or row names:

In [8]:
homelessness.index

Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
       36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
      dtype='int64')

pandas `Index` objects are a subtle topic that will receive a lot more treatment later.

### Sorting rows

`.sort_values()` with one argument will sort a `DataFrame` by that column, ascending. For example:

In [9]:
homelessness.sort_values('individuals')

Unnamed: 0,region,state,individuals,family_members,state_pop
50,Mountain,Wyoming,434.0,205.0,577601
34,West North Central,North Dakota,467.0,75.0,758080
7,South Atlantic,Delaware,708.0,374.0,965479
39,New England,Rhode Island,747.0,354.0,1058287
45,New England,Vermont,780.0,511.0,624358
...,...,...,...,...,...
47,Pacific,Washington,16424.0,5880.0,7523869
43,West South Central,Texas,19199.0,6111.0,28628666
9,South Atlantic,Florida,21443.0,9587.0,21244317
32,Mid-Atlantic,New York,39827.0,52070.0,19530351


Descending order can be achieved almost as easily with the `ascending` keyword argument:

In [10]:
homelessness.sort_values('family_members', ascending=False)

Unnamed: 0,region,state,individuals,family_members,state_pop
32,Mid-Atlantic,New York,39827.0,52070.0,19530351
4,Pacific,California,109008.0,20964.0,39461588
21,New England,Massachusetts,6811.0,13257.0,6882635
9,South Atlantic,Florida,21443.0,9587.0,21244317
43,West South Central,Texas,19199.0,6111.0,28628666
...,...,...,...,...,...
24,East South Central,Mississippi,1024.0,328.0,2981020
41,West North Central,South Dakota,836.0,323.0,878698
48,South Atlantic,West Virginia,1021.0,222.0,1804291
50,Mountain,Wyoming,434.0,205.0,577601


### Subsetting columns

Individual columns can be extracted as `Series` objects using square bracket indexing:

In [11]:
homelessness['individuals']

0       2570.0
1       1434.0
2       7259.0
3       2280.0
4     109008.0
        ...   
46      3928.0
47     16424.0
48      1021.0
49      2740.0
50       434.0
Name: individuals, Length: 51, dtype: float64

`DataFrame`s can be subsetted using a `list` to index the original `DataFrame`:

In [12]:
homelessness[['state', 'family_members']]

Unnamed: 0,state,family_members
0,Alabama,864.0
1,Alaska,582.0
2,Arizona,2606.0
3,Arkansas,432.0
4,California,20964.0
...,...,...
46,Virginia,2047.0
47,Washington,5880.0
48,West Virginia,222.0
49,Wisconsin,2167.0


Note that this approach can be used to subset `DataFrame`s with one column, as opposed to extracting `Series` objects as described above:

In [13]:
homelessness[['state']]

Unnamed: 0,state
0,Alabama
1,Alaska
2,Arizona
3,Arkansas
4,California
...,...
46,Virginia
47,Washington
48,West Virginia
49,Wisconsin


### Subsetting rows

Rows of a `DataFrame` can be subsetted using a `Series` of `bool`s. Here's a simple example:

In [14]:
homelessness[homelessness['individuals'] > 10_000]

Unnamed: 0,region,state,individuals,family_members,state_pop
4,Pacific,California,109008.0,20964.0,39461588
9,South Atlantic,Florida,21443.0,9587.0,21244317
32,Mid-Atlantic,New York,39827.0,52070.0,19530351
37,Pacific,Oregon,11139.0,3337.0,4181886
43,West South Central,Texas,19199.0,6111.0,28628666
47,Pacific,Washington,16424.0,5880.0,7523869


Here's another example:

In [15]:
homelessness[homelessness['region'] == 'Mountain']

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259.0,2606.0,7158024
5,Mountain,Colorado,7607.0,3250.0,5691287
12,Mountain,Idaho,1297.0,715.0,1750536
26,Mountain,Montana,983.0,422.0,1060665
28,Mountain,Nevada,7058.0,486.0,3027341
31,Mountain,New Mexico,1949.0,602.0,2092741
44,Mountain,Utah,1904.0,972.0,3153550
50,Mountain,Wyoming,434.0,205.0,577601


Here's a more complicated example using two `Series` of `bools` joined with a Boolean "and" operation. **Note the necessity of parentheses and to use `&` instead of `and`**:

In [16]:
homelessness[(homelessness['family_members'] < 1000) & (homelessness['region'] == 'Pacific')]

Unnamed: 0,region,state,individuals,family_members,state_pop
1,Pacific,Alaska,1434.0,582.0,735139


### Subsetting rows by categorical variables

Multiple values of a categorical variable can be matched using the `.isin()` method of pandas `Series` objects:

In [17]:
# Subset the Mojave Desert states
homelessness[homelessness['state'].isin(['California', 'Arizona', 'Nevada', 'Utah'])]

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259.0,2606.0,7158024
4,Pacific,California,109008.0,20964.0,39461588
28,Mountain,Nevada,7058.0,486.0,3027341
44,Mountain,Utah,1904.0,972.0,3153550
