# Introduction To Pandas Tutorial

Author: Vi Ly  
Date: 21 Mar 2021  
LinkedIn: https://www.linkedin.com/in/vi-ly-2810ba59/

The easiest way to install Python and pandas is through Anaconda: https://www.anaconda.com/products/individual.

pandas is a very expansive package and this tutorial only covers a portion of its capability.  For further reading, refer to the documentation: https://pandas.pydata.org/docs/reference/index.html

This tutorial will use the **Auto MPG** Dataset from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/auto+mpg

## Table of Contents

1. [Importing Pandas](#import)
2. [Reading in Data](#read)
3. [Basic Functionality](#basic)
4. [Accessing Columns](#access)
5. [Descriptive Statistics](#sumstats)
6. [Sorting](#sort)
7. [Comparison Operators](#compare)
8. [Dealing with Missing Data](#missing)
9. [Dealing with Duplicates](#duplicate)
10. [Math Operations](#math_ops)
11. [A Word About Indices](#index)
12. [Functions in Python](#functions)
13. [Groupby](#groupby)
14. [Merging](#merge)
15. [One-Hot Encoding](#onehot)
16. [Mapping](#mapping)
17. [Subsetting](#subset)
18. [Concatenating DataFrames](#concat)
19. [Lagging Variables](#lag)
20. [Rolling Functions](#rolling)
21. [String Methods](#string)
22. [Data Conversion](#data_conversion)
23. [Exporting Data](#export)

# Importing Pandas <a name="import"><a/>

It is standard convention to alias **pandas** as **pd**.

In [1]:
import pandas as pd

Import additional libraries.

In [2]:
import os

import numpy as np

# Reading in Data <a name="read"><a/>

Python Variable Naming Rules
- Can only contain letters, numbers, and underscores
- Must start with either letter or underscore; cannot start with number
- Case-sensitive

Use pandas to read in a csv, using the **.read_csv()** method, as a DataFrame and assign it to the variable mpg_df.

Pandas offers a vast array of functions to read in different file types.  Refer to the documentation.

In [3]:
mpg_df = pd.read_csv(f'c:/users/{os.getlogin()}/desktop/mpg dataset.csv')

# Basic Functionality <a name="basic"><a/>

Get the column names by calling the **.columns** attribute

In [4]:
mpg_df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year', 'origin', 'car_name'],
      dtype='object')

Get the DataFrame dimensions by calling the **.shape** attribute.

In [5]:
mpg_df.shape

(406, 9)

You can also get the # of rows and columns using the **len()** function.  
Get the number of columns in the DataFrame.

In [6]:
len(mpg_df.columns)

9

Get the number of rows in the DataFrame.

In [7]:
len(mpg_df)

406

The **.head()** method, by default, returns the first 5 rows in the DataFrame.

In [8]:
mpg_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


Providing an integer to the **.head()** method returns the specified first X rows.

In [9]:
mpg_df.head(10)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino
5,15.0,8,429.0,198.0,4341,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220.0,4354,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215.0,4312,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225.0,4425,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190.0,3850,8.5,70,1,amc ambassador dpl


The **.tail()** method, by default, returns the last 5 rows in the DataFrame.

In [10]:
mpg_df.tail()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
401,27.0,4,140.0,86.0,2790,15.6,82,1,ford mustang gl
402,44.0,4,97.0,52.0,2130,24.6,82,2,vw pickup
403,32.0,4,135.0,84.0,2295,11.6,82,1,dodge rampage
404,28.0,4,120.0,79.0,2625,18.6,82,1,ford ranger
405,31.0,4,119.0,82.0,2720,19.4,82,1,chevy s-10


Similarly, providing an integer to the **.tail()** method provides the X last rows.

In [11]:
mpg_df.tail(10)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
396,26.0,4,156.0,92.0,2585,14.5,82,1,chrysler lebaron medallion
397,22.0,6,232.0,112.0,2835,14.7,82,1,ford granada l
398,32.0,4,144.0,96.0,2665,13.9,82,3,toyota celica gt
399,36.0,4,135.0,84.0,2370,13.0,82,1,dodge charger 2.2
400,27.0,4,151.0,90.0,2950,17.3,82,1,chevrolet camaro
401,27.0,4,140.0,86.0,2790,15.6,82,1,ford mustang gl
402,44.0,4,97.0,52.0,2130,24.6,82,2,vw pickup
403,32.0,4,135.0,84.0,2295,11.6,82,1,dodge rampage
404,28.0,4,120.0,79.0,2625,18.6,82,1,ford ranger
405,31.0,4,119.0,82.0,2720,19.4,82,1,chevy s-10


The **help()** function provides information on the function / method.

In [12]:
help(pd.DataFrame.head)

Help on function head in module pandas.core.generic:

head(self: ~FrameOrSeries, n: int = 5) -> ~FrameOrSeries
    Return the first `n` rows.
    
    This function returns the first `n` rows for the object based
    on position. It is useful for quickly testing if your object
    has the right type of data in it.
    
    For negative values of `n`, this function returns all rows except
    the last `n` rows, equivalent to ``df[:-n]``.
    
    Parameters
    ----------
    n : int, default 5
        Number of rows to select.
    
    Returns
    -------
    same type as caller
        The first `n` rows of the caller object.
    
    See Also
    --------
    DataFrame.tail: Returns the last `n` rows.
    
    Examples
    --------
    >>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
    ...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
    >>> df
          animal
    0  alligator
    1        bee
    2     falcon
    3       lion
    4   

The **.info()** method provides basic information on the DataFrame such as:
- \# of rows
- \# of columns
- column names
- how data is stored
- \# of missing values

In [13]:
mpg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     406 non-null    int64  
 2   displacement  406 non-null    float64
 3   horsepower    400 non-null    float64
 4   weight        406 non-null    int64  
 5   acceleration  406 non-null    float64
 6   model_year    406 non-null    int64  
 7   origin        406 non-null    int64  
 8   car_name      406 non-null    object 
dtypes: float64(4), int64(4), object(1)
memory usage: 28.7+ KB


The **.rename()** method allows you to rename the index or column names.

In [117]:
mpg_df.rename(columns={'mpg': 'miles_per_gallon', 'horsepower': 'hp'})

Unnamed: 0,miles_per_gallon,cylinders,displacement,hp,weight,acceleration,model_year,origin,car_name,mpg_fill_0,mpg_fill_mean,weight_standardized,weight_standardized_from_func,origin_1,origin_2,origin_3,origin_str
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu,18.0,18.0,0.619343,0.619343,1,0,0,US
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320,15.0,15.0,0.842482,0.842482,1,0,0,US
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite,18.0,18.0,0.539060,0.539060,1,0,0,US
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst,16.0,16.0,0.535518,0.535518,1,0,0,US
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino,17.0,17.0,0.554408,0.554408,1,0,0,US
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
401,27.0,4,140.0,86.0,2790,15.6,82,1,ford mustang gl,27.0,27.0,-0.223628,-0.223628,1,0,0,US
402,44.0,4,97.0,52.0,2130,24.6,82,2,vw pickup,44.0,44.0,-1.002845,-1.002845,0,1,0,Europe
403,32.0,4,135.0,84.0,2295,11.6,82,1,dodge rampage,32.0,32.0,-0.808040,-0.808040,1,0,0,US
404,28.0,4,120.0,79.0,2625,18.6,82,1,ford ranger,28.0,28.0,-0.418432,-0.418432,1,0,0,US


## Accessing Columns<a name="access"><a/>

You can access a column in the DataFrame using **bracket** notation or **dot** notation.  The recommended approach is to use **bracket** notation.

#### Bracket Notation

In [14]:
mpg_df['cylinders']

0      8
1      8
2      8
3      8
4      8
      ..
401    4
402    4
403    4
404    4
405    4
Name: cylinders, Length: 406, dtype: int64

#### Dot Notation

In [15]:
mpg_df.cylinders

0      8
1      8
2      8
3      8
4      8
      ..
401    4
402    4
403    4
404    4
405    4
Name: cylinders, Length: 406, dtype: int64

Individual columns in a DataFrame are referred to as Series.

In [16]:
type(mpg_df['cylinders'])

pandas.core.series.Series

In [17]:
type(mpg_df)

pandas.core.frame.DataFrame

You can only use **bracket** notation to access several columns.  Note that this syntax uses double square bracket; additionally, it returns back a DataFrame instead of a Series.

In [18]:
mpg_df[['mpg', 'cylinders', 'displacement']]

Unnamed: 0,mpg,cylinders,displacement
0,18.0,8,307.0
1,15.0,8,350.0
2,18.0,8,318.0
3,16.0,8,304.0
4,17.0,8,302.0
...,...,...,...
401,27.0,4,140.0
402,44.0,4,97.0
403,32.0,4,135.0
404,28.0,4,120.0


## Descriptive Statistics<a name="sumstats"><a/>

The **.value_counts()** method provides the counts for each unique value in the column.  The output will be in descending order with largest count first.

In [19]:
mpg_df['cylinders'].value_counts()

4    207
8    108
6     84
3      4
5      3
Name: cylinders, dtype: int64

When the argument **normalize=True** is provided to **.value_counts()** method, the output will be in percentages.

Note: **True** (along with its counterpart **False**) are reserved keywords in Python.

In [20]:
mpg_df['cylinders'].value_counts(normalize=True)

4    0.509852
8    0.266010
6    0.206897
3    0.009852
5    0.007389
Name: cylinders, dtype: float64

The **.describe()** method provides basic summary statistics including:
- count
- mean
- min
- max
- std dev
- 25th, 50th, 75th percentiles

In [21]:
mpg_df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
count,398.0,406.0,406.0,400.0,406.0,406.0,406.0,406.0
mean,23.514573,5.475369,194.779557,105.0825,2979.413793,15.519704,75.921182,1.568966
std,7.815984,1.71216,104.922458,38.768779,847.004328,2.803359,3.748737,0.797479
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,105.0,75.75,2226.5,13.7,73.0,1.0
50%,23.0,4.0,151.0,95.0,2822.5,15.5,76.0,1.0
75%,29.0,8.0,302.0,130.0,3618.25,17.175,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


Providing a sequence of decimals to the **.describe()** method provides the specified percentiles and overrides the default percentiles.

In [22]:
mpg_df.describe([.1, .2, .3, .4, .5, .6, .7, .8, .9, 1])

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
count,398.0,406.0,406.0,400.0,406.0,406.0,406.0,406.0
mean,23.514573,5.475369,194.779557,105.0825,2979.413793,15.519704,75.921182,1.568966
std,7.815984,1.71216,104.922458,38.768779,847.004328,2.803359,3.748737,0.797479
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
10%,14.0,4.0,90.0,67.0,1987.5,12.0,71.0,1.0
20%,16.0,4.0,98.0,72.0,2155.0,13.2,72.0,1.0
30%,18.0,4.0,112.0,80.7,2315.0,14.0,73.0,1.0
40%,20.0,4.0,122.0,88.0,2587.0,14.8,75.0,1.0
50%,23.0,4.0,151.0,95.0,2822.5,15.5,76.0,1.0
60%,25.0,6.0,225.0,100.0,3102.0,16.0,77.0,1.0


Get the column mean with the **.mean()** method.

In [23]:
mpg_df['mpg'].mean()

23.514572864321615

Get the column min with the **.min()** method.

In [24]:
mpg_df['mpg'].min()

9.0

Get the column max with the **.min()** method.

In [25]:
mpg_df['mpg'].max()

46.6

Get the column std dev with the **.std()** method.

In [26]:
mpg_df['mpg'].std()

7.815984312565782

Get the column quantile with the **.quantile()** method.

In [27]:
mpg_df['mpg'].quantile(.3)

18.0

Get the column total with the **.sum()** method.

In [28]:
mpg_df['weight'].sum()

1209642

Some methods can be applied to multiple columns at once.

In [29]:
mpg_df[['mpg', 'displacement', 'horsepower']].min()

mpg              9.0
displacement    68.0
horsepower      46.0
dtype: float64

Additionally, they can be applied to the entire DataFrame.

In [30]:
mpg_df.min()

mpg                                   9
cylinders                             3
displacement                         68
horsepower                           46
weight                             1613
acceleration                          8
model_year                           70
origin                                1
car_name        amc ambassador brougham
dtype: object

Some methods can also be applied row-wise, by using the **axis=1** argument.  The example below calculates the row-wise minimum of mpg, displacement, and horsepower columns.

In [31]:
mpg_df[['mpg', 'displacement', 'horsepower']].min(axis=1)

0      18.0
1      15.0
2      18.0
3      16.0
4      17.0
       ... 
401    27.0
402    44.0
403    32.0
404    28.0
405    31.0
Length: 406, dtype: float64

The **.corr()** method provides the correlation matrix as a DataFrame.

In [32]:
mpg_df.corr()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
mpg,1.0,-0.775396,-0.804203,-0.778427,-0.831741,0.420289,0.579267,0.56345
cylinders,-0.775396,1.0,0.951787,0.844158,0.89522,-0.522452,-0.360762,-0.567478
displacement,-0.804203,0.951787,1.0,0.898326,0.932475,-0.557984,-0.381714,-0.613056
horsepower,-0.778427,0.844158,0.898326,1.0,0.866586,-0.697124,-0.424419,-0.460033
weight,-0.831741,0.89522,0.932475,0.866586,1.0,-0.430086,-0.315389,-0.584109
acceleration,0.420289,-0.522452,-0.557984,-0.697124,-0.430086,1.0,0.301992,0.218845
model_year,0.579267,-0.360762,-0.381714,-0.424419,-0.315389,0.301992,1.0,0.187656
origin,0.56345,-0.567478,-0.613056,-0.460033,-0.584109,0.218845,0.187656,1.0


## Sorting<a name="sort"><a/>

Sorting is done using the **.sort_values()** method.

In [33]:
mpg_df.head(10)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino
5,15.0,8,429.0,198.0,4341,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220.0,4354,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215.0,4312,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225.0,4425,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190.0,3850,8.5,70,1,amc ambassador dpl


Sort the DataFrame by the displacement column.  By default, sorting is done in ascending order.

In [34]:
displacement_ascending_df = mpg_df.sort_values(['displacement'])
displacement_ascending_df.head(10)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
124,29.0,4,68.0,49.0,1867,19.5,73,2,fiat 128
341,23.7,3,70.0,100.0,2420,12.5,80,3,mazda rx-7 gs
78,19.0,3,70.0,97.0,2330,13.5,72,3,mazda rx2 coupe
118,18.0,3,70.0,90.0,2124,13.5,73,3,maxda rx3
60,31.0,4,71.0,65.0,1773,19.0,71,3,toyota corolla 1200
138,32.0,4,71.0,65.0,1836,21.0,74,3,toyota corolla 1200
61,35.0,4,72.0,69.0,1613,18.0,71,3,datsun 1200
151,31.0,4,76.0,52.0,1649,16.5,74,3,toyota corona
253,32.8,4,78.0,52.0,1985,19.4,78,3,mazda glc deluxe
350,39.1,4,79.0,58.0,1755,16.9,81,3,toyota starlet


Sort the DataFrame by the displacement column in descending order, by using the ascending argument.

In [35]:
displacement_descending_df = mpg_df.sort_values(['displacement'], ascending=[False])
displacement_descending_df.head(10)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
19,14.0,8,455.0,225.0,3086,10.0,70,1,buick estate wagon (sw)
8,14.0,8,455.0,225.0,4425,10.0,70,1,pontiac catalina
102,12.0,8,455.0,225.0,4951,11.0,73,1,buick electra 225 custom
6,14.0,8,454.0,220.0,4354,9.0,70,1,chevrolet impala
101,13.0,8,440.0,215.0,4735,11.0,73,1,chrysler new yorker brougham
7,14.0,8,440.0,215.0,4312,8.5,70,1,plymouth fury iii
74,11.0,8,429.0,208.0,4633,11.0,72,1,mercury marquis
97,12.0,8,429.0,198.0,4952,11.5,73,1,mercury marquis brougham
5,15.0,8,429.0,198.0,4341,10.0,70,1,ford galaxie 500
98,13.0,8,400.0,150.0,4464,12.0,73,1,chevrolet caprice classic


Sorting by multiple columns.

In [36]:
multi_sort_df = mpg_df.sort_values(['displacement', 'mpg'], ascending=[False, True])
multi_sort_df.head(10)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
102,12.0,8,455.0,225.0,4951,11.0,73,1,buick electra 225 custom
8,14.0,8,455.0,225.0,4425,10.0,70,1,pontiac catalina
19,14.0,8,455.0,225.0,3086,10.0,70,1,buick estate wagon (sw)
6,14.0,8,454.0,220.0,4354,9.0,70,1,chevrolet impala
101,13.0,8,440.0,215.0,4735,11.0,73,1,chrysler new yorker brougham
7,14.0,8,440.0,215.0,4312,8.5,70,1,plymouth fury iii
74,11.0,8,429.0,208.0,4633,11.0,72,1,mercury marquis
97,12.0,8,429.0,198.0,4952,11.5,73,1,mercury marquis brougham
5,15.0,8,429.0,198.0,4341,10.0,70,1,ford galaxie 500
110,11.0,8,400.0,150.0,4997,14.0,73,1,chevrolet impala


## Comparison Operators<a name="compare">

- **>=** : greater than or equal to
- **>** : greater than
- **<=** : less than or equal to
- **<** : less than
- **==** : equals
- **!=** : not equals

In [37]:
mpg_df['cylinders'] == 8

0       True
1       True
2       True
3       True
4       True
       ...  
401    False
402    False
403    False
404    False
405    False
Name: cylinders, Length: 406, dtype: bool

In [38]:
(mpg_df['cylinders'] != 8).sum()

298

In [39]:
mpg_df['mpg'] >= mpg_df['acceleration']

0      True
1      True
2      True
3      True
4      True
       ... 
401    True
402    True
403    True
404    True
405    True
Length: 406, dtype: bool

In [40]:
(mpg_df['mpg'] >= mpg_df['acceleration']) & (mpg_df['cylinders'] >= 4)

0      True
1      True
2      True
3      True
4      True
       ... 
401    True
402    True
403    True
404    True
405    True
Length: 406, dtype: bool

In [41]:
(mpg_df['mpg'] >= mpg_df['acceleration']) | (mpg_df['cylinders'] >= 4)

0      True
1      True
2      True
3      True
4      True
       ... 
401    True
402    True
403    True
404    True
405    True
Length: 406, dtype: bool

In [42]:
mpg_df['cylinders'].isin([4, 8])

0      True
1      True
2      True
3      True
4      True
       ... 
401    True
402    True
403    True
404    True
405    True
Name: cylinders, Length: 406, dtype: bool

In [43]:
~mpg_df['cylinders'].isin([4, 8])

0      False
1      False
2      False
3      False
4      False
       ...  
401    False
402    False
403    False
404    False
405    False
Name: cylinders, Length: 406, dtype: bool

## Dealing With Missing Values<a name="missing"><a/>

In [44]:
mpg_df['mpg'].isna()

0      False
1      False
2      False
3      False
4      False
       ...  
401    False
402    False
403    False
404    False
405    False
Name: mpg, Length: 406, dtype: bool

In [45]:
mpg_df['mpg'].isna().sum()

8

In [46]:
mpg_df['horsepower'].isna().sum()

6

In [47]:
mpg_df['mpg'].isna().sum()

8

In [48]:
mpg_df.isna()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
401,False,False,False,False,False,False,False,False,False
402,False,False,False,False,False,False,False,False,False
403,False,False,False,False,False,False,False,False,False
404,False,False,False,False,False,False,False,False,False


Checking for # of blank values column wise.

In [49]:
mpg_df.isna().sum()

mpg             8
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
car_name        0
dtype: int64

Checking for number of missing values row-wise.

In [50]:
mpg_df.isna().sum(axis=1)

0      0
1      0
2      0
3      0
4      0
      ..
401    0
402    0
403    0
404    0
405    0
Length: 406, dtype: int64

Fill blank values with 0's.

In [51]:
mpg_df['mpg_fill_0'] = mpg_df['mpg'].fillna(0)

In [52]:
mpg_df['mpg_fill_0'].isna().sum()

0

In [53]:
mpg_df[['mpg', 'mpg_fill_0']].describe()

Unnamed: 0,mpg,mpg_fill_0
count,398.0,406.0
mean,23.514573,23.051232
std,7.815984,8.401777
min,9.0,0.0
25%,17.5,17.0
50%,23.0,22.35
75%,29.0,29.0
max,46.6,46.6


Fill blank values with the average value.

In [54]:
mpg_df['mpg_fill_mean'] = mpg_df['mpg'].fillna(mpg_df['mpg'].mean())

**.fillna()** method also allows you to fill forward / backward.

In [55]:
mpg_df.iloc[list(range(9, 18)) + [38, 39, 366, 367]]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name,mpg_fill_0,mpg_fill_mean
9,15.0,8,390.0,190.0,3850,8.5,70,1,amc ambassador dpl,15.0,15.0
10,,4,133.0,115.0,3090,17.5,70,2,citroen ds-21 pallas,0.0,23.514573
11,,8,350.0,165.0,4142,11.5,70,1,chevrolet chevelle concours (sw),0.0,23.514573
12,,8,351.0,153.0,4034,11.0,70,1,ford torino (sw),0.0,23.514573
13,,8,383.0,175.0,4166,10.5,70,1,plymouth satellite (sw),0.0,23.514573
14,,8,360.0,175.0,3850,11.0,70,1,amc rebel sst (sw),0.0,23.514573
15,15.0,8,383.0,170.0,3563,10.0,70,1,dodge challenger se,15.0,15.0
16,14.0,8,340.0,160.0,3609,8.0,70,1,plymouth 'cuda 340,14.0,14.0
17,,8,302.0,140.0,3353,8.0,70,1,ford mustang boss 302,0.0,23.514573
38,25.0,4,98.0,,2046,19.0,71,1,ford pinto,25.0,25.0


In [56]:
mpg_df.iloc[list(range(9, 18)) + [38, 39, 366, 367]].fillna(method='ffill')

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name,mpg_fill_0,mpg_fill_mean
9,15.0,8,390.0,190.0,3850,8.5,70,1,amc ambassador dpl,15.0,15.0
10,15.0,4,133.0,115.0,3090,17.5,70,2,citroen ds-21 pallas,0.0,23.514573
11,15.0,8,350.0,165.0,4142,11.5,70,1,chevrolet chevelle concours (sw),0.0,23.514573
12,15.0,8,351.0,153.0,4034,11.0,70,1,ford torino (sw),0.0,23.514573
13,15.0,8,383.0,175.0,4166,10.5,70,1,plymouth satellite (sw),0.0,23.514573
14,15.0,8,360.0,175.0,3850,11.0,70,1,amc rebel sst (sw),0.0,23.514573
15,15.0,8,383.0,170.0,3563,10.0,70,1,dodge challenger se,15.0,15.0
16,14.0,8,340.0,160.0,3609,8.0,70,1,plymouth 'cuda 340,14.0,14.0
17,14.0,8,302.0,140.0,3353,8.0,70,1,ford mustang boss 302,0.0,23.514573
38,25.0,4,98.0,140.0,2046,19.0,71,1,ford pinto,25.0,25.0


In [57]:
mpg_df.iloc[list(range(9, 18)) + [38, 39, 366, 367, 368]].fillna(method='bfill')

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name,mpg_fill_0,mpg_fill_mean
9,15.0,8,390.0,190.0,3850,8.5,70,1,amc ambassador dpl,15.0,15.0
10,15.0,4,133.0,115.0,3090,17.5,70,2,citroen ds-21 pallas,0.0,23.514573
11,15.0,8,350.0,165.0,4142,11.5,70,1,chevrolet chevelle concours (sw),0.0,23.514573
12,15.0,8,351.0,153.0,4034,11.0,70,1,ford torino (sw),0.0,23.514573
13,15.0,8,383.0,175.0,4166,10.5,70,1,plymouth satellite (sw),0.0,23.514573
14,15.0,8,360.0,175.0,3850,11.0,70,1,amc rebel sst (sw),0.0,23.514573
15,15.0,8,383.0,170.0,3563,10.0,70,1,dodge challenger se,15.0,15.0
16,14.0,8,340.0,160.0,3609,8.0,70,1,plymouth 'cuda 340,14.0,14.0
17,25.0,8,302.0,140.0,3353,8.0,70,1,ford mustang boss 302,0.0,23.514573
38,25.0,4,98.0,48.0,2046,19.0,71,1,ford pinto,25.0,25.0


The **.dropna()** method, by default, drops rows with any missing values.

In [58]:
mpg_dropna_df = mpg_df.dropna()

In [59]:
len(mpg_dropna_df)

392

In [60]:
len(mpg_df)

406

In [61]:
mpg_dropna_hp_df = mpg_df.dropna(subset=['horsepower'])

In [62]:
len(mpg_dropna_hp_df)

400

## Dealing With Duplicates<a name="duplicate"><a/>

The **.unique()** method returns a pd.Series of unique values for the specified column.

In [63]:
mpg_df['cylinders'].unique()

array([8, 4, 6, 3, 5], dtype=int64)

The **.nunique()** method returns the number of unique values in the specified column.

In [64]:
mpg_df['cylinders'].nunique()

5

The **duplicated()** method returns a pd.Series of boolean values denoting of the value is a duplicate.  By default, the first value is not considered a duplicate.

In [65]:
mpg_df['cylinders']

0      8
1      8
2      8
3      8
4      8
      ..
401    4
402    4
403    4
404    4
405    4
Name: cylinders, Length: 406, dtype: int64

In [66]:
mpg_df['cylinders'].duplicated()

0      False
1       True
2       True
3       True
4       True
       ...  
401     True
402     True
403     True
404     True
405     True
Name: cylinders, Length: 406, dtype: bool

In [67]:
mpg_df['cylinders'].duplicated().sum()

401

The **.drop_duplicates()** method removes rows that are duplicated.  By default, it removes rows that are exact duplicates.  Additionally, by default, the first occurrence is not considered a duplicate and is retained.

In [68]:
mpg_df_drop_dup = mpg_df.drop_duplicates()

In [69]:
len(mpg_df_drop_dup) == len(mpg_df)

True

You can specify which columns to check for duplicates and remove them.

In [70]:
mpg_df_drop_dup_cyl = mpg_df.drop_duplicates('cylinders')
mpg_df_drop_dup_cyl

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name,mpg_fill_0,mpg_fill_mean
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu,18.0,18.0
10,,4,133.0,115.0,3090,17.5,70,2,citroen ds-21 pallas,0.0,23.514573
21,22.0,6,198.0,95.0,2833,15.5,70,1,plymouth duster,22.0,22.0
78,19.0,3,70.0,97.0,2330,13.5,72,3,mazda rx2 coupe,19.0,19.0
281,20.3,5,131.0,103.0,2830,15.9,78,2,audi 5000,20.3,20.3


In [71]:
len(mpg_df_drop_dup_cyl) == len(mpg_df)

False

In [72]:
mpg_df_drop_dup_cyl_origin = mpg_df.drop_duplicates(['cylinders', 'origin'])

In [73]:
mpg_df_drop_dup_cyl_origin

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name,mpg_fill_0,mpg_fill_mean
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu,18.0,18.0
10,,4,133.0,115.0,3090,17.5,70,2,citroen ds-21 pallas,0.0,23.514573
20,24.0,4,113.0,95.0,2372,15.0,70,3,toyota corona mark ii,24.0,24.0
21,22.0,6,198.0,95.0,2833,15.5,70,1,plymouth duster,22.0,22.0
36,28.0,4,140.0,90.0,2264,15.5,71,1,chevrolet vega 2300,28.0,28.0
78,19.0,3,70.0,97.0,2330,13.5,72,3,mazda rx2 coupe,19.0,19.0
130,20.0,6,156.0,122.0,2807,13.5,73,3,toyota mark ii,20.0,20.0
218,16.5,6,168.0,120.0,3820,16.7,76,2,mercedes-benz 280s,16.5,16.5
281,20.3,5,131.0,103.0,2830,15.9,78,2,audi 5000,20.3,20.3


In [74]:
len(mpg_df_drop_dup_cyl_origin) == len(mpg_df)

False

## Math Operations with DataFrames<a name="math_ops"><a/>

In [75]:
mpg_df['weight_standardized'] = (mpg_df['weight'] - mpg_df['weight'].mean()) / mpg_df['weight'].std()

In [76]:
mpg_df[['weight', 'weight_standardized']].head()

Unnamed: 0,weight,weight_standardized
0,3504,0.619343
1,3693,0.842482
2,3436,0.53906
3,3433,0.535518
4,3449,0.554408


## A Word About Indices<a name="index"><a/>

In [77]:
df1 = pd.DataFrame({'x': [1, 2, 3]})
df1

Unnamed: 0,x
0,1
1,2
2,3


In [78]:
df2 = pd.DataFrame({'y': [1, 2, 3]})
df2.index = [2, 1, 0]
df2

Unnamed: 0,y
2,1
1,2
0,3


In [79]:
df1['x'] + df2['y']

0    4
1    4
2    4
dtype: int64

## Functions in Python<a name="functions"><a/>

Functions are defined using the reserved keyword **def**.  Indented code block lets Python know what included in the function definition.

In [80]:
def subtract_none(a, b):
    a - b

In [81]:
bad_sub = subtract_none(5, 1)

You will notice that when we call the variable bad_sub, we got nothing back; we should be expecting the value 4.  The reason bad_sub returns nothing back is because our function definition did not include a return statement.  Therefore, the function does the operation, but does not save the results.

In [82]:
bad_sub

We'll define the function correctly this time with a return statement.

In [83]:
def subtract(a, b):
    return a - b

In [84]:
good_sub = subtract(5, 1)

In [85]:
good_sub

4

In Python, functions do not execute any code after the first return statement.  In the following example, we have 2 return statements for our subtract_return2 function.  However, when we execute this function, nothing after **return a - b** is executed.

In [86]:
def subtract_return2(a, b):
    return a - b
    return b - a

In [87]:
subtract_return2(5, 1)

4

Here is a more relevant example where we create our own standardize column function.  Note: The scikit-learn (sklearn) package performs standardization for you.

In [88]:
def standardize_col(col_values):
    return (col_values - col_values.mean()) / col_values.std()

In [89]:
standardize_col(mpg_df['weight'])

0      0.619343
1      0.842482
2      0.539060
3      0.535518
4      0.554408
         ...   
401   -0.223628
402   -1.002845
403   -0.808040
404   -0.418432
405   -0.306272
Name: weight, Length: 406, dtype: float64

In [90]:
mpg_df['weight_standardized_from_func'] = standardize_col(mpg_df['weight'])

In [91]:
mpg_df['weight_standardized'] != mpg_df['weight_standardized_from_func']

0      False
1      False
2      False
3      False
4      False
       ...  
401    False
402    False
403    False
404    False
405    False
Length: 406, dtype: bool

In [92]:
(mpg_df['weight_standardized'] != mpg_df['weight_standardized_from_func']).sum()

0

It is good practice to add **docstrings** to your functions.  This allows your to call the **help()** function, which was discussed at the beginning of this tutorial.

In [93]:
def standardize_col_docstring(col_values):
    """
    This function standardizes a given pd.Series and returns the standardized values as another pd.Series.
    """
    return (col_values - col_values.mean()) / col_values.std()

In [94]:
help(standardize_col_docstring)

Help on function standardize_col_docstring in module __main__:

standardize_col_docstring(col_values)
    This function standardizes a given pd.Series and returns the standardized values as another pd.Series.



In [95]:
help(standardize_col)

Help on function standardize_col in module __main__:

standardize_col(col_values)



## Groupby<a name="groupby"><a/>

In [96]:
mpg_df.groupby('cylinders')['weight'].mean()

cylinders
3    2398.500000
4    2312.685990
5    3103.333333
6    3198.226190
8    4105.194444
Name: weight, dtype: float64

In [97]:
mpg_df.groupby('cylinders')['weight'].apply(standardize_col)

0     -1.350689
1     -0.926067
2     -1.503463
3     -1.510203
4     -1.474256
         ...   
401    1.358938
402   -0.520116
403   -0.050353
404    0.889174
405    1.159644
Name: weight, Length: 406, dtype: float64

In [98]:
mpg_df.groupby('cylinders')['weight'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
cylinders,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3,4.0,2398.5,247.566153,2124.0,2278.5,2375.0,2495.0,2720.0
4,207.0,2312.68599,351.240579,1613.0,2045.5,2234.0,2573.5,3270.0
5,3.0,3103.333333,374.34387,2830.0,2890.0,2950.0,3240.0,3530.0
6,84.0,3198.22619,332.297419,2472.0,2941.25,3201.5,3430.5,3907.0
8,108.0,4105.194444,445.102182,3086.0,3810.0,4137.5,4382.75,5140.0


You can transpose DataFrames with the **.T**.

In [99]:
mpg_df.groupby('cylinders')['weight'].describe().T

cylinders,3,4,5,6,8
count,4.0,207.0,3.0,84.0,108.0
mean,2398.5,2312.68599,3103.333333,3198.22619,4105.194444
std,247.566153,351.240579,374.34387,332.297419,445.102182
min,2124.0,1613.0,2830.0,2472.0,3086.0
25%,2278.5,2045.5,2890.0,2941.25,3810.0
50%,2375.0,2234.0,2950.0,3201.5,4137.5
75%,2495.0,2573.5,3240.0,3430.5,4382.75
max,2720.0,3270.0,3530.0,3907.0,5140.0


In [100]:
mpg_df.groupby(['cylinders', 'origin']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,displacement,horsepower,weight,acceleration,model_year,mpg_fill_0,mpg_fill_mean,weight_standardized,weight_standardized_from_func
cylinders,origin,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,3,20.55,72.5,99.25,2398.5,13.25,75.5,20.55,20.55,-0.685845,-0.685845
4,1,27.840278,124.284722,80.956522,2437.166667,16.526389,78.027778,27.840278,27.840278,-0.640194,-0.640194
4,2,28.411111,104.80303,78.90625,2343.318182,16.763636,75.439394,27.119697,28.188541,-0.750995,-0.750995
4,3,31.595652,99.768116,75.57971,2153.492754,16.569565,77.507246,31.595652,31.595652,-0.975108,-0.975108
5,2,27.366667,145.0,82.333333,3103.333333,18.633333,79.0,27.366667,27.366667,0.146303,0.146303
6,1,19.663514,226.283784,99.671233,3213.905405,16.474324,75.635135,19.663514,19.663514,0.276848,0.276848
6,2,20.1,159.75,113.5,3382.5,16.425,78.25,20.1,20.1,0.475896,0.475896
6,3,23.883333,156.666667,115.833333,2882.0,13.55,78.0,23.883333,23.883333,-0.11501,-0.11501
8,1,14.963107,345.203704,158.453704,4105.194444,12.837037,73.722222,14.27037,15.359008,1.329132,1.329132


## Merging<a name="merge"><a/>

Suppose we have a separate DataFrame that contains the grouped weight averages.

In [101]:
mpg_grpby = mpg_df.groupby(['cylinders', 'origin'])['weight'].mean().reset_index()

In [102]:
mpg_grpby = mpg_grpby.rename(columns={'weight': 'grpby_avg_weight'})

In [103]:
mpg_grpby

Unnamed: 0,cylinders,origin,grpby_avg_weight
0,3,3,2398.5
1,4,1,2437.166667
2,4,2,2343.318182
3,4,3,2153.492754
4,5,2,3103.333333
5,6,1,3213.905405
6,6,2,3382.5
7,6,3,2882.0
8,8,1,4105.194444


We would like to merge this DataFrame with our original DataFrame.  This can be done using the **pd.merge()** function.  By default, this function performs an inner join.  You can pass in other arguments to change how to perform the merging.

In [104]:
mpg_merged = pd.merge(mpg_df, mpg_grpby, on=['cylinders', 'origin'])
mpg_merged.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year', 'origin', 'car_name', 'mpg_fill_0',
       'mpg_fill_mean', 'weight_standardized', 'weight_standardized_from_func',
       'grpby_avg_weight'],
      dtype='object')

In [105]:
mpg_merged.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name,mpg_fill_0,mpg_fill_mean,weight_standardized,weight_standardized_from_func,grpby_avg_weight
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu,18.0,18.0,0.619343,0.619343,4105.194444
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320,15.0,15.0,0.842482,0.842482,4105.194444
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite,18.0,18.0,0.53906,0.53906,4105.194444
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst,16.0,16.0,0.535518,0.535518,4105.194444
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino,17.0,17.0,0.554408,0.554408,4105.194444


In [106]:
mpg_merged.tail()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name,mpg_fill_0,mpg_fill_mean,weight_standardized,weight_standardized_from_func,grpby_avg_weight
401,16.2,6,163.0,133.0,3410,15.8,78,2,peugeot 604sl,16.2,16.2,0.508364,0.508364,3382.5
402,30.7,6,145.0,76.0,3160,19.6,81,2,volvo diesel,30.7,30.7,0.213206,0.213206,3382.5
403,20.3,5,131.0,103.0,2830,15.9,78,2,audi 5000,20.3,20.3,-0.176403,-0.176403,3103.333333
404,25.4,5,183.0,77.0,3530,20.1,79,2,mercedes benz 300d,25.4,25.4,0.650039,0.650039,3103.333333
405,36.4,5,121.0,67.0,2950,19.9,80,2,audi 5000s (diesel),36.4,36.4,-0.034727,-0.034727,3103.333333


## One-Hot Encoding<a name="onehot"><a/>

In [107]:
mpg_df[['origin_1', 'origin_2', 'origin_3']] = pd.get_dummies(mpg_df['origin'])

In [108]:
mpg_df[['origin', 'origin_1', 'origin_2', 'origin_3']]

Unnamed: 0,origin,origin_1,origin_2,origin_3
0,1,1,0,0
1,1,1,0,0
2,1,1,0,0
3,1,1,0,0
4,1,1,0,0
...,...,...,...,...
401,1,1,0,0
402,2,0,1,0
403,1,1,0,0
404,1,1,0,0


## Mapping<a name="mapping"><a/>

In [109]:
mpg_df['origin_str'] = mpg_df['origin'].map({1: 'US', 2: 'Europe', 3: 'Japan'})

In [110]:
mpg_df[['origin', 'origin_str']]

Unnamed: 0,origin,origin_str
0,1,US
1,1,US
2,1,US
3,1,US
4,1,US
...,...,...
401,1,US
402,2,Europe
403,1,US
404,1,US


## Subsetting Data<a name="subset">

In [114]:
mpg_df[(mpg_df['cylinders'] > 4) & (mpg_df['mpg'] > 15)]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name,mpg_fill_0,mpg_fill_mean,weight_standardized,weight_standardized_from_func,origin_1,origin_2,origin_3,origin_str
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu,18.0,18.0,0.619343,0.619343,1,0,0,US
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite,18.0,18.0,0.539060,0.539060,1,0,0,US
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst,16.0,16.0,0.535518,0.535518,1,0,0,US
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino,17.0,17.0,0.554408,0.554408,1,0,0,US
21,22.0,6,198.0,95.0,2833,15.5,70,1,plymouth duster,22.0,22.0,-0.172861,-0.172861,1,0,0,US
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
373,20.2,6,200.0,88.0,3060,17.1,81,1,ford granada gl,20.2,20.2,0.095143,0.095143,1,0,0,US
374,17.6,6,225.0,85.0,3465,16.6,81,1,chrysler lebaron salon,17.6,17.6,0.573298,0.573298,1,0,0,US
394,25.0,6,181.0,110.0,2945,16.4,82,1,buick century limited,25.0,25.0,-0.040630,-0.040630,1,0,0,US
395,38.0,6,262.0,85.0,3015,17.0,82,1,oldsmobile cutlass ciera (diesel),38.0,38.0,0.042014,0.042014,1,0,0,US


#### .loc Method

In [116]:
mpg_df.loc[(mpg_df['cylinders'] > 4) & (mpg_df['mpg'] > 15), ['mpg', 'cylinders', 'horsepower', 'origin']]

Unnamed: 0,mpg,cylinders,horsepower,origin
0,18.0,8,130.0,1
2,18.0,8,150.0,1
3,16.0,8,150.0,1
4,17.0,8,140.0,1
21,22.0,6,95.0,1
...,...,...,...,...
373,20.2,6,88.0,1
374,17.6,6,85.0,1
394,25.0,6,110.0,1
395,38.0,6,85.0,1


#### .iloc Method

In [111]:
mpg_df.iloc[[0, 2, 3], [1, 2]]

Unnamed: 0,cylinders,displacement
0,8,307.0
2,8,318.0
3,8,304.0


In [112]:
mpg_df.iloc[2:, :-3]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name,mpg_fill_0,mpg_fill_mean,weight_standardized,weight_standardized_from_func,origin_1
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite,18.0,18.0,0.539060,0.539060,1
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst,16.0,16.0,0.535518,0.535518,1
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino,17.0,17.0,0.554408,0.554408,1
5,15.0,8,429.0,198.0,4341,10.0,70,1,ford galaxie 500,15.0,15.0,1.607532,1.607532,1
6,14.0,8,454.0,220.0,4354,9.0,70,1,chevrolet impala,14.0,14.0,1.622880,1.622880,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
401,27.0,4,140.0,86.0,2790,15.6,82,1,ford mustang gl,27.0,27.0,-0.223628,-0.223628,1
402,44.0,4,97.0,52.0,2130,24.6,82,2,vw pickup,44.0,44.0,-1.002845,-1.002845,0
403,32.0,4,135.0,84.0,2295,11.6,82,1,dodge rampage,32.0,32.0,-0.808040,-0.808040,1
404,28.0,4,120.0,79.0,2625,18.6,82,1,ford ranger,28.0,28.0,-0.418432,-0.418432,1


## Concatenating DataFrames<a name="concat"><a/>

In [None]:
mpg_df_cyl_4 = mpg_df[mpg_df['cylinders'] == 4]
len(mpg_df_cyl_4)

In [None]:
mpg_df_cyl_6 = mpg_df[mpg_df['cylinders'] == 6]
len(mpg_df_cyl_6)

In [None]:
mpg_df_cyl_8 = mpg_df[mpg_df['cylinders'] == 8]
len(mpg_df_cyl_8)

In [None]:
mpg_df_concat = pd.concat([mpg_df_cyl_4, mpg_df_cyl_6, mpg_df_cyl_8])
len(mpg_df_concat)

In [None]:
mpg_df_concat['cylinders'].value_count()

## Lagging Variables<a name="lag"><a/>

In [None]:
mpg_df['mpg_lag1'] = mpg_df['mpg'].shift(1)

In [None]:
mpg_df['mpg_lag2'] = mpg_df['mpg'].shift(2)

In [None]:
mpg_df[['mpg', 'mpg_lag1', 'mpg_lag2']].head(10)

## Rolling Functions<a name="rolling"><a/>

In [None]:
np.random.seed(1234)
stock_prices = pd.DataFrame({
    'day': range(1, 366), 
    'stock': np.round(np.abs(np.random.normal(loc=1000, scale=1000, size=365)), 2)
})

In [None]:
stock_prices.head(20)

In [None]:
stock_prices['rolling_5_sum'] = stock_prices['stock'].rolling(5).sum()

In [None]:
stock_prices['rolling_5_mean'] = stock_prices['stock'].rolling(5).mean()

In [None]:
stock_prices['rolling_5_std'] = stock_prices['stock'].rolling(5).std()

In [None]:
stock_prices['rolling_5_min'] = stock_prices['stock'].rolling(5).min()

In [None]:
stock_prices['rolling_5_max'] = stock_prices['stock'].rolling(5).max()

In [None]:
stock_prices.head(15)

## String Methods<a name="string"><a/>

In [None]:
mpg_df['car_name']

Convert strings to all uppercase using **.str.upper()** method.

In [None]:
mpg_df['car_name'].str.upper()

Capitalize the first letter of every word using **.str.title()** method.

In [None]:
mpg_df['car_name'].str.title()

Check if strings start with specified string using **.str.startswith()** method.  Conversely, there is also a **.str.endswith()** method.  **Note**: Python is **case-sensitive**.

In [None]:
mpg_df['car_name'].str.startswith('chev')

In [None]:
mpg_df['car_name'].str.startswith('chev').sum()

The **.contains()** method also supports regular expressions.  Note: The base Python package **re** is dedicated to regular expressions.

In this example, check if a string contains any digit 0-9.

In [None]:
mpg_df['car_name'].str.contains('\d').sum()

Replace characters with the **.replace()** method.

In [None]:
mpg_df['car_name'].str.replace('c', 'T')

The **.str.strip()** method removes leading and trailing characters specified by the user.  There are also **.str.lstrip()** method which removes leading characters only and **.str.rstrip()** which removes trailing characters only.

In [None]:
mpg_df['disp_as_str'].str.strip('0')

The **.str.zfill()** method comes in handy dealing with string columns that are usually dealing with accounts.  In the example below, let's make a "pretend" account column.  Let's suppose that our account column needs to have leading 0's.

In [None]:
mpg_df['fake_acct_str'] = mpg_df['disp_as_str'].str.rstrip('.0')
mpg_df['fake_acct_str']

In the example below, the 9 represents how long the string should be in length.  Strings shorter than 9 are left padded with 0's so that the new length is 9.  Nothing happens to strings with lengths >= 9.

In [None]:
mpg_df['fake_acct_str'].str.zfill(9)

## Data Conversion<a name="data_conversion"><a/>

In [None]:
mpg_df.info()

We see that the displacement column is stored as float.  Let's convert it to integer using the **.astype()** method.

In [None]:
mpg_df['disp_as_int'] = mpg_df['displacement'].astype(int)

In [None]:
mpg_df.info()

Note: You can also call the **.dtype** attribute to check how the column is stored.  Notice that there are no parentheses after **.dtype**; this is because we are accessing the attribute and not calling a method.

In [None]:
mpg_df['displacement'].dtype

In [None]:
mpg_df['disp_as_int'].dtype

In [None]:
mpg_df['disp_as_str'] = mpg_df['displacement'].astype(str)

In [None]:
mpg_df['disp_as_str']

## Exporting DataFrames<a name="export"><a/>

Just as pandas has many methods for reading in files, it also has several methods to export DataFrames.

In [None]:
mpg_df.to_csv(f'c:/users/{os.getlogin()}/mpg_df_csv.csv')

In [None]:
mpg_df.to_excel(f'c:/users/{os.getlogin()/mpg_df_xl.xlsx', index=False})