### Example Data Science Project

Let's use pandas to analyze data from the Utah Avalanche Center. 

In [4]:
import pandas as pd


df = pd.read_csv('ava-all.csv')
df.dtypes

Unnamed: 0                           int64
Accident and Rescue Summary:        object
Aspect:                             object
Avalanche Problem:                  object
Avalanche Type:                     object
Buried - Fully:                    float64
Buried - Partly:                   float64
Carried:                           float64
Caught:                            float64
Comments:                           object
Coordinates:                        object
Depth:                              object
Elevation:                          object
Injured:                           float64
Killed:                              int64
Location Name or Route:             object
Observation Date:                   object
Observer Name:                      object
Occurence Time:                     object
Occurrence Date:                    object
Region:                             object
Slope Angle:                       float64
Snow Profile Comments:              object
Terrain Sum

We can see that the dataset is huge, and contain numeric data as well as strings (with `dtype=object`). The dates are generally not read as such, and we'll need to convert those.

### Describing Data

Let's check the size of the dataset:

In [5]:
df.shape

(92, 38)

Not bad, 92 rows and 38 coluns. Next let's grab some summary statistics:

In [6]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Buried - Fully:,Buried - Partly:,Carried:,Caught:,Injured:,Killed:,Slope Angle:,Video:,killed
count,92.0,64.0,22.0,71.0,72.0,5.0,92.0,42.0,0.0,92.0
mean,45.5,1.15625,1.090909,1.591549,1.638889,1.0,1.163043,37.785714,,1.163043
std,26.70206,0.365963,0.294245,1.049863,1.091653,0.0,0.47526,5.567921,,0.47526
min,0.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0,,1.0
25%,22.75,1.0,1.0,1.0,1.0,1.0,1.0,36.0,,1.0
50%,45.5,1.0,1.0,1.0,1.0,1.0,1.0,38.0,,1.0
75%,68.25,1.0,1.0,2.0,2.0,1.0,1.0,40.0,,1.0
max,91.0,2.0,2.0,7.0,7.0,1.0,4.0,50.0,,4.0


Based on this, already we know that there were 64 avalanches that had people buried. The avergae number of people buried was 1.15, with a minimum of 1 person and a maximum of 2. We can also tell that there are many `NaN` values, although we aren't sure what that means. One way to check is to drop all the `NaN` values, then generate the summary statistics again:

In [7]:
df.fillna(0).describe()

Unnamed: 0.1,Unnamed: 0,Buried - Fully:,Buried - Partly:,Carried:,Caught:,Injured:,Killed:,Slope Angle:,Video:,killed
count,92.0,92.0,92.0,92.0,92.0,92.0,92.0,92.0,92.0,92.0
mean,45.5,0.804348,0.26087,1.228261,1.282609,0.054348,1.163043,17.25,0.0,1.163043
std,26.70206,0.615534,0.488765,1.139725,1.179738,0.227945,0.47526,19.289936,0.0,0.47526
min,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
25%,22.75,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0
50%,45.5,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0
75%,68.25,1.0,0.0,2.0,2.0,0.0,1.0,37.0,0.0,1.0
max,91.0,2.0,2.0,7.0,7.0,1.0,4.0,50.0,0.0,4.0


Now the average number of people burried dropped to 0.8. This might alert us to check with someone in the know to better interpret the `NaN` values. 

Now, we can clean up the appearance of the DataFrame by replacing all the colons in the column labels with nothing:

In [9]:
df = df.rename(columns={x: x.replace(':', '') for x in df.columns})

df.describe()

Unnamed: 0,Unnamed 0,Buried - Fully,Buried - Partly,Carried,Caught,Injured,Killed,Slope Angle,Video,killed
count,92.0,64.0,22.0,71.0,72.0,5.0,92.0,42.0,0.0,92.0
mean,45.5,1.15625,1.090909,1.591549,1.638889,1.0,1.163043,37.785714,,1.163043
std,26.70206,0.365963,0.294245,1.049863,1.091653,0.0,0.47526,5.567921,,0.47526
min,0.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0,,1.0
25%,22.75,1.0,1.0,1.0,1.0,1.0,1.0,36.0,,1.0
50%,45.5,1.0,1.0,1.0,1.0,1.0,1.0,38.0,,1.0
75%,68.25,1.0,1.0,2.0,2.0,1.0,1.0,40.0,,1.0
max,91.0,2.0,2.0,7.0,7.0,1.0,4.0,50.0,,4.0


### Categorical Data

There are many columns that didn't appear in the `.describe()` output table because they do not contain numerical values. These categorical data are important and we can inspect them with the `.value_counts()` method:

In [11]:
df.loc[:, 'Aspect'].value_counts() # Recall that this generate a text-based "histogram" of sorts

Northeast    24
North        14
East          9
Northwest     9
Southeast     3
West          3
South         1
Name: Aspect, dtype: int64

In [12]:
df.loc[:, 'Avalanche Type'].value_counts()

Hard Slab       27
Soft Slab       24
Cornice Fall     1
Wet Slab         1
Name: Avalanche Type, dtype: int64

It seems that there are missing values in these categorical columns as well. We can check clearly by summing the counts:

In [14]:
df.loc[:, 'Avalanche Type'].value_counts().sum()

53

### Converting Column Types

The "Depth" column should have been numeric but didn't show up in the `.describe()` table. Why is that?

In [15]:
df.loc[:, 'Depth'].head(15)

0       3'
1       4'
2       4'
3      18"
4       8"
5       2'
6       3'
7       2'
8      16"
9       3'
10    2.5'
11     16"
12     NaN
13    3.5'
14      8'
Name: Depth, dtype: object

So this is why, the feet and inch (') symbols were added and that messed up Python and pandas ability to recognize these values as numerics. The best way to deal with data that are this messy is via regular expression:

In [16]:
import re


def to_inches(orig):
    txt = str(orig)
    if txt == 'nan':
        return orig
    reg = r'''(((\d*\.)?\d*)')?(((\d*\.)?\d*)")?'''
    mo = re.search(reg, txt)
    feet = mo.group(2) or 0
    inches = mo.group(5) or 0
    return float(feet) * 12 + float(inches)

The above function returns `NaN` if that is what's in the depth entry. Otherwise it looks for optional feed (numbers followed by ') and optional inches (numbers followed by ''), then it computes and convert it into total inches.

In [17]:
df.loc[:, 'depth_inches'] = df.loc[:, 'Depth'].apply(to_inches)

df.loc[:, 'depth_inches'].describe()

count    61.000000
mean     32.573770
std      17.628064
min       0.000000
25%      24.000000
50%      30.000000
75%      42.000000
max      96.000000
Name: depth_inches, dtype: float64

We're almost there; there are still a lot of missing values. Let's fill them with the median depth:

In [18]:
df.loc[:, 'depth_inches'] = df.loc[:, 'depth_inches'].fillna(df.loc[:, 'depth_inches'].median())

df.loc[:, 'depth_inches'].describe()

count    92.000000
mean     31.706522
std      14.366122
min       0.000000
25%      24.000000
50%      30.000000
75%      36.000000
max      96.000000
Name: depth_inches, dtype: float64

Another column that pandas misinterpreted as non-numeric is the "Vertical" column:

In [19]:
df.loc[:, 'Vertical'].head(15)

0        1500
1         200
2         175
3         125
4        1500
5         250
6          50
7        1000
8         600
9         350
10       2500
11        800
12        900
13    Unknown
14       1000
Name: Vertical, dtype: object

This is much easier, as pandas is simply treating the whole column as strings (because of the "Unknown" entries). We can simply use the `to_numeric()` function while passing `errors='coerce'` to convert all "Unknown" to `NaN`:

In [21]:
df.loc[:, 'vert'] = pd.to_numeric(df.loc[:, 'Vertical'], errors='coerce')

df.loc[:, 'vert']

0     1500.0
1      200.0
2      175.0
3      125.0
4     1500.0
       ...  
87       NaN
88    1500.0
89     300.0
90    1250.0
91    1250.0
Name: vert, Length: 92, dtype: float64