<a href="https://colab.research.google.com/github/victorviro/Machine-Learning-Python/blob/master/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas

[Pandas](https://github.com/pandas-dev/pandas) is a specialized Python library for data analysis, especially on humongous datasets. It boasts easy-to-use functionality for reading and writing data, dealing with missing data, reshaping the dataset, and massaging the data by slicing, indexing, inserting, and deleting data variables and records. Pandas also has an important **groupBy** functionality for aggregating data for defined conditions, useful for plotting and computing data summaries for exploration.

In short, Pandas is the go-to tool for data cleaning and data exploration.
To use Pandas, first import the Pandas module:

In [None]:
import pandas as pd

## Pandas Data Structures

Just like NumPy, Pandas can store and manipulate a multi-dimensional array of data. To handle this, Pandas has the **Series** and **DataFrame** data structures.


### Series



The **Series** data structure is for storing a 1-D array (or vector) of data elements. A series data structure also provides labels to the data items in the form of an **index**. The user can specify this label via the **index** parameter in the **Series** function, but if the **index** parameter is left unspecified, a default label of 0 to one minus the size of the data elements is assigned.
Let us consider an example of creating a **Series** data structure.

In [None]:
# Create a Series object
my_series = pd.Series([2,4,6,8], index=['e1','e2','e3','e4'])
# Print out data in Series data structure
print(my_series)

# Check the data type of the variable
print(type(my_series))

# Return the elements of the Series data structure
print(my_series.values) 

# Retrieve elements from Series data structure based on their assigned indices
print(my_series['e1'])

# Return all indices of the Series data structure
print(my_series.index)

e1    2
e2    4
e3    6
e4    8
dtype: int64
<class 'pandas.core.series.Series'>
[2 4 6 8]
2
Index(['e1', 'e2', 'e3', 'e4'], dtype='object')


Elements in a Series data structure can be assigned the same indices.

In [None]:
# Create a Series object with elements sharing indices
my_series = pd.Series([2,4,6,8], index=['e1','e2','e1','e2'])

# Note the same index assigned to various elements
print(my_series)

# Get elements using their index
print(my_series['e1'])

e1    2
e2    4
e1    6
e2    8
dtype: int64
e1    2
e1    6
dtype: int64


### DataFrames



A DataFrame is a Pandas data structure for storing and manipulating 2-D arrays. A 2-D array is a table-like structure that is similar to an Excel spreadsheet or a relational database table. A DataFrame is a very natural form for storing structured datasets.

A DataFrame consists of rows and columns for storing records of information (in rows) across heterogeneous variables (in columns).

Let’s see examples of working with DataFrames.

In [None]:
# Create a data frame
my_DF = pd.DataFrame({'age': [15,17,21,29,25], \
            'state_of_origin':['Lagos', 'Cross River', 'Kano', 'Abia', 'Benue']})
my_DF 

Unnamed: 0,age,state_of_origin
0,15,Lagos
1,17,Cross River
2,21,Kano
3,29,Abia
4,25,Benue


We will observe from the preceding example that a DataFrame is constructed from a dictionary of records where each value is a **Series** data structure. Also note that each row has an **index** that can be assigned when creating the DataFrame, else the default from 0 to one off the number of records in the DataFrame is used. Creating an index manually is usually not feasible except when working with small dummy datasets.

NumPy is frequently used together with Pandas. Let’s import the NumPy library and use some of its functions to demonstrate other ways of creating a quick DataFrame.

In [None]:
import numpy as np

# Create a 3x3 dataframe of numbers from the normal distribution
my_DF = pd.DataFrame(np.random.randn(3,3),\
            columns=['First','Second','Third'])
print(my_DF)

# Check the dimensions
print(my_DF.shape)

      First    Second     Third
0 -0.530889  0.202075 -0.202340
1 -0.983681 -0.156605  1.208290
2 -0.271540 -0.927292  0.861979
(3, 3)


Let’s examine some other operations with DataFrames.

In [None]:
# Create a python dictionary
my_dict = {'State':['Adamawa', 'Akwa-Ibom', 'Yobe', 'Rivers', 'Taraba'], \
            'Capital':['Yola','Uyo','Damaturu','Port-Harcourt','Jalingo'], \
            'Population':[3178950, 5450758, 2321339, 5198716, 2294800]}
print(my_dict)

# Confirm dictionary type
print(type(my_dict))

# Create DataFrame from dictionary
my_DF = pd.DataFrame(my_dict)
print(my_DF) 

# Check DataFrame type
print(type(my_DF))

# Retrieve column names of the DataFrame
print(my_DF.columns)

# The data type of `DF.columns` method is an Index
print(type(my_DF.columns))

# Retrieve the DataFrame values as a NumPy ndarray
print(my_DF.values) 

# The data type of  `DF.values` method is an numpy ndarray
print(type(my_DF.values)) 

{'State': ['Adamawa', 'Akwa-Ibom', 'Yobe', 'Rivers', 'Taraba'], 'Capital': ['Yola', 'Uyo', 'Damaturu', 'Port-Harcourt', 'Jalingo'], 'Population': [3178950, 5450758, 2321339, 5198716, 2294800]}
<class 'dict'>
       State        Capital  Population
0    Adamawa           Yola     3178950
1  Akwa-Ibom            Uyo     5450758
2       Yobe       Damaturu     2321339
3     Rivers  Port-Harcourt     5198716
4     Taraba        Jalingo     2294800
<class 'pandas.core.frame.DataFrame'>
Index(['State', 'Capital', 'Population'], dtype='object')
<class 'pandas.core.indexes.base.Index'>
[['Adamawa' 'Yola' 3178950]
 ['Akwa-Ibom' 'Uyo' 5450758]
 ['Yobe' 'Damaturu' 2321339]
 ['Rivers' 'Port-Harcourt' 5198716]
 ['Taraba' 'Jalingo' 2294800]]
<class 'numpy.ndarray'>


In summary, a DataFrame is a tabular structure for storing a structured dataset where each column contains a **Series** data structure of records.

![]()

Let’s check the data type of each column in the DataFrame.

In [None]:
my_DF.dtypes

State         object
Capital       object
Population     int64
dtype: object

An **object** data type in Pandas represents **Strings**.

## Data Indexing (Selection/Subsets)



Similar to NumPy, Pandas objects can index or subset the dataset to retrieve a specific sub-record of the larger dataset. Note that data indexing returns a new **DataFrame** or **Series** if a 2-D or 1-D array is retrieved. They do not, however, alter the original dataset. Let’s go through some examples of indexing a Pandas DataFrame.

First let’s create a dataframe. Observe the default integer indices assigned.

In [None]:
# Create the dataframe
my_DF = pd.DataFrame({'age': [15,17,21,29,25], \
            'state_of_origin':['Lagos', 'Cross River', 'Kano', 'Abia', 'Benue']})
my_DF 

Unnamed: 0,age,state_of_origin
0,15,Lagos
1,17,Cross River
2,21,Kano
3,29,Abia
4,25,Benue


### Selecting a Column from a DataFrame

Remember that the data type of a DataFrame column is a **Series** because it is a vector or 1-D array.

In [None]:
print(my_DF['age'])

# Check data type
print(type(my_DF['age']))

0    15
1    17
2    21
3    29
4    25
Name: age, dtype: int64
<class 'pandas.core.series.Series'>


To select multiple columns, enclose the column names as **strings** with the double square brackets `[[ ]]`. The following code is an example:

In [None]:
my_DF[['age','state_of_origin']]

Unnamed: 0,age,state_of_origin
0,15,Lagos
1,17,Cross River
2,21,Kano
3,29,Abia
4,25,Benue


### Selecting a Row from a DataFrame

Pandas makes use of two unique wrapper attributes for indexing rows from a **DataFrame** or a cell from a **Series** data structure. These attributes are the `iloc` and `loc` (they are also known as indexers). The `iloc` attribute allows us to select or slice row(s) of a DataFrame using the intrinsic Python index format, whereas the `loc` attribute uses the explicit indices assigned to the DataFrame. If no explicit index is found, `loc` returns the same value as `iloc`.

Remember that the data type of a DataFrame row is a **Series** because it is a vector or 1-D array.

Let’s select the first row from the DataFrame.

In [None]:
# Using explicit indexing
print(my_DF.loc[0])

# Using implicit indexing
print(my_DF.iloc[0])

# Let's see the data type
print(type(my_DF.loc[0])) 

age                   15
state_of_origin    Lagos
Name: 0, dtype: object
age                   15
state_of_origin    Lagos
Name: 0, dtype: object
<class 'pandas.core.series.Series'>


Now let’s create a DataFrame with explicit indexing and test out the `iloc` and `loc` methods. Pandas will return an error if `iloc` is used for explicit indexing or if `loc` is used for implicit Python indexing.

In [None]:
 my_DF = pd.DataFrame({'age': [15,17,21,29,25], \
            'state_of_origin':['Lagos', 'Cross River', 'Kano', 'Abia', 'Benue']},\
            index=['a','a','b','b','c'])
 
# Observe the string indices
print(my_DF) 

# Select using explicit indexing
print(my_DF.loc['a'])

# Let's try to use loc for implicit indexing
#print(my_DF.loc[0])

   age state_of_origin
a   15           Lagos
a   17     Cross River
b   21            Kano
b   29            Abia
c   25           Benue
   age state_of_origin
a   15           Lagos
a   17     Cross River


### Selecting Multiple Rows and Columns from a DataFrame

Let’s use the `loc` method to select multiple rows and columns from a Pandas DataFrame.

In [None]:
# Select rows with age greater than 20
my_DF.loc[my_DF.age > 20]

Unnamed: 0,age,state_of_origin
b,21,Kano
b,29,Abia
c,25,Benue


In [None]:
# Find states of origin with age greater than or equal to 25
my_DF.loc[my_DF.age >= 25, 'state_of_origin'] 

b     Abia
c    Benue
Name: state_of_origin, dtype: object

### Slice Cells by Row and Column from a DataFrame


First let’s create a DataFrame. Remember, we use `iloc` when no explicit index or row labels are assigned.

In [None]:
my_DF = pd.DataFrame({'age': [15,17,21,29,25], \
            'state_of_origin':['Lagos', 'Cross River', 'Kano', 'Abia', 'Benue']})
my_DF

Unnamed: 0,age,state_of_origin
0,15,Lagos
1,17,Cross River
2,21,Kano
3,29,Abia
4,25,Benue


In [None]:
# Select the third row and second column
my_DF.iloc[2,1]

'Kano'

In [None]:
# Slice the first 2 rows - indexed from zero, excluding the final index
my_DF.iloc[:2,]

Unnamed: 0,age,state_of_origin
0,15,Lagos
1,17,Cross River


In [None]:
# Slice the last three rows from the last column
my_DF.iloc[-3:,-1] 

2     Kano
3     Abia
4    Benue
Name: state_of_origin, dtype: object

## DataFrame Manipulation

Let’s go through some common tasks for manipulating a DataFrame.

### Removing a Row/Column

In many cases during the data cleaning process, there may be a need to drop unwanted rows or data variables (i.e., columns). We typically do this using the `drop` function. The `drop` function has a parameter `axis` whose default is 0. If `axis` is set to 1, it drops columns in a dataset, but if left at the default, rows are dropped from the dataset.
Note that when a column or row is dropped, a new DataFrame or Series is returned without altering the original data structure. However, when the attribute inplace is set to True, the original DataFrame or Series is modified. Let’s see some examples.

In [None]:
my_DF = pd.DataFrame({'age': [15,17,21,29,25], \
            'state_of_origin':['Lagos', 'Cross River', 'Kano', 'Abia', 'Benue']})
print(my_DF)

# Drop the 3rd and 4th column
print(my_DF.drop([2,4])) 

# Drop the `age` column
print(my_DF.drop('age', axis=1))

# The original DataFrame is unchanged
print(my_DF )

# Drop using 'inplace' - to modify the original DataFrame
print(my_DF.drop('age', axis=1, inplace=True))

# The original DataFrame altered
print(my_DF)

   age state_of_origin
0   15           Lagos
1   17     Cross River
2   21            Kano
3   29            Abia
4   25           Benue
   age state_of_origin
0   15           Lagos
1   17     Cross River
3   29            Abia
  state_of_origin
0           Lagos
1     Cross River
2            Kano
3            Abia
4           Benue
   age state_of_origin
0   15           Lagos
1   17     Cross River
2   21            Kano
3   29            Abia
4   25           Benue
None
  state_of_origin
0           Lagos
1     Cross River
2            Kano
3            Abia
4           Benue


Let’s see examples of removing a row given a condition.

In [None]:
my_DF = pd.DataFrame({'age': [15,17,21,29,25], \
            'state_of_origin':['Lagos', 'Cross River', 'Kano', 'Abia', 'Benue']})
print(my_DF)

# Drop all rows less than 20
my_DF.drop(my_DF[my_DF['age'] < 20].index, inplace=True)
my_DF 

   age state_of_origin
0   15           Lagos
1   17     Cross River
2   21            Kano
3   29            Abia
4   25           Benue


Unnamed: 0,age,state_of_origin
2,21,Kano
3,29,Abia
4,25,Benue


### Adding a Row/Column

We can add a new column to a Pandas DataFrame by using the assign method.

In [None]:
# show dataframe
my_DF = pd.DataFrame({'age': [15,17,21,29,25], \
            'state_of_origin':['Lagos', 'Cross River', 'Kano', 'Abia', 'Benue']})
print(my_DF)

# Add column to data frame
my_DF = my_DF.assign(capital_city = pd.Series(['Ikeja', 'Calabar', \
                                                'Kano', 'Umuahia', 'Makurdi']))
my_DF 

   age state_of_origin
0   15           Lagos
1   17     Cross River
2   21            Kano
3   29            Abia
4   25           Benue


Unnamed: 0,age,state_of_origin,capital_city
0,15,Lagos,Ikeja
1,17,Cross River,Calabar
2,21,Kano,Kano
3,29,Abia,Umuahia
4,25,Benue,Makurdi


We can also add a new DataFrame column by computing some function on another column. Let’s take an example by adding a column computing the absolute difference of the ages from their mean.


In [None]:
mean_of_age = my_DF['age'].mean()
my_DF['diff_age'] = my_DF['age'].map( lambda x: abs(x-mean_of_age))
my_DF 

Unnamed: 0,age,state_of_origin,capital_city,diff_age
0,15,Lagos,Ikeja,6.4
1,17,Cross River,Calabar,4.4
2,21,Kano,Kano,0.4
3,29,Abia,Umuahia,7.6
4,25,Benue,Makurdi,3.6


Typically in practice, a fully formed dataset is converted into Pandas for cleaning and data analysis, which does not ideally involve adding a new observation to the dataset. But in the event that this is desired, we can use the `append()` method to achieve this. However, it may not be a computationally efficient action. Let’s see an example.

In [None]:
my_DF = pd.DataFrame({'age': [15,17,21,29,25], \
            'state_of_origin':['Lagos', 'Cross River', 'Kano', 'Abia', 'Benue']})
print(my_DF )

# Add a row to data frame
my_DF = my_DF.append(pd.Series([30 , 'Osun'], index=my_DF.columns), \
                                                        ignore_index=True)
my_DF 

   age state_of_origin
0   15           Lagos
1   17     Cross River
2   21            Kano
3   29            Abia
4   25           Benue


Unnamed: 0,age,state_of_origin
0,15,Lagos
1,17,Cross River
2,21,Kano
3,29,Abia
4,25,Benue
5,30,Osun


We observe that adding a new row involves passing to the `append` method, a **Series** object with the **index** attribute set to the columns of the main DataFrame. Since typically, in given datasets, the index is nothing more than the assigned defaults, we set the attribute `ignore_index` to create a new set of default index values with the new row(s).

### Data Alignment

Pandas utilizes data alignment to align indices when performing some binary arithmetic operation on DataFrames. If two or more DataFrames in an arithmetic operation do not share a common index, a `NaN` is introduced denoting missing data. Let’s see examples of this.

In [None]:
# Create a 3x3 dataframe - remember randint(low, high, size)
df_A = pd.DataFrame(np.random.randint(1,10,[3,3]),\
            columns=['First','Second','Third'])
print(df_A) 

# Create a 4x3 dataframe
df_B = pd.DataFrame(np.random.randint(1,10,[4,3]),\
            columns=['First','Second','Third'])
print(df_B) 

# Sdd df_A and df_B together
print(df_A + df_B) 

# Divide both dataframes
print(df_A / df_B)

   First  Second  Third
0      8       1      1
1      5       8      7
2      5       9      6
   First  Second  Third
0      9       8      9
1      6       9      2
2      4       5      7
3      5       9      7
   First  Second  Third
0   17.0     9.0   10.0
1   11.0    17.0    9.0
2    9.0    14.0   13.0
3    NaN     NaN    NaN
      First    Second     Third
0  0.888889  0.125000  0.111111
1  0.833333  0.888889  3.500000
2  1.250000  1.800000  0.857143
3       NaN       NaN       NaN


If we do not want a `NaN` signifying missing values to be imputed, we can use the `fill_value` attribute to substitute with a default value. However, to take advantage of the `fill_value` attribute, we have to use the Pandas arithmetic methods: `add()`, `sub()`, `mul()`, `div()`, `floordiv()`, `mod()`, and `pow()` for addition, subtraction, multiplication, integer division, numeric division, remainder division, and exponentiation. Let’s see examples.

In [None]:
df_A.add(df_B, fill_value=10)

Unnamed: 0,First,Second,Third
0,17.0,9.0,10.0
1,11.0,17.0,9.0
2,9.0,14.0,13.0
3,15.0,19.0,17.0


### Combining Datasets

We may need to combine two or more datasets together; Pandas provides methods for such operations. We would consider the simple case of combining data frames with shared column names using the `concat` method.

In [None]:
# Combine two dataframes column-wise
pd.concat([df_A, df_B]) 

Unnamed: 0,First,Second,Third
0,8,1,1
1,5,8,7
2,5,9,6
0,9,8,9
1,6,9,2
2,4,5,7
3,5,9,7


Observe that the `concat` method preserves indices by default. We can also concatenate or combine two dataframes by rows (or horizontally). This is done by setting the `axis` parameter to 1.

In [None]:
# Combine two dataframes horizontally
pd.concat([df_A, df_B], axis=1) 

Unnamed: 0,First,Second,Third,First.1,Second.1,Third.1
0,8.0,1.0,1.0,9,8,9
1,5.0,8.0,7.0,6,9,2
2,5.0,9.0,6.0,4,5,7
3,,,,5,9,7


## Handling Missing Data

Dealing with missing data is an integral part of the data cleaning/data analysis process. Moreover, some machine learning algorithms will not work in the presence of missing data. Let’s see some simple Pandas methods for identifying and removing missing data, as well as imputing values into missing data.

### Identifying Missing Data

In this section, we’ll use the `isnull()` method to check if missing cells exist in a DataFrame.

In [None]:
# Let's create a data frame with missing data
my_DF = pd.DataFrame({'age': [15,17,np.nan,29,25], \
            'state_of_origin':['Lagos', 'Cross River', 'Kano', 'Abia', np.nan]})
my_DF 

Unnamed: 0,age,state_of_origin
0,15.0,Lagos
1,17.0,Cross River
2,,Kano
3,29.0,Abia
4,25.0,


Let’s check for missing data in this data frame. The `isnull()` method will return `True` where there is a missing data, whereas the `notnull()` function returns `False`.

In [None]:
my_DF.isnull() 

Unnamed: 0,age,state_of_origin
0,False,False
1,False,False
2,True,False
3,False,False
4,False,True


However, if we want a single answer (i.e., either `True` or `False`) to report if there is a missing data in the data frame, we will first convert the DataFrame to a NumPy array and use the function `any()`.

The `any` function returns `True` when at least one of the elements in the dataset is `True`. In this case, `isnull()` returns a DataFrame of booleans where `True` designates a cell with a missing value.
Let’s see how that works.

In [None]:
my_DF.isnull().values.any()

True

### Removing Missing Data

Pandas has a function `dropna()` which is used to filter or remove missing data from a DataFrame. `dropna()` returns a new DataFrame without missing data. Let’s see examples of how this works.

In [None]:
# Let's see our dataframe with missing data
my_DF = pd.DataFrame({'age': [15,17,np.nan,29,25], \
            'state_of_origin':['Lagos', 'Cross River', 'Kano', 'Abia', np.nan]})
print(my_DF)

# Let's run dropna() to remove all rows with missing values
my_DF.dropna() 

    age state_of_origin
0  15.0           Lagos
1  17.0     Cross River
2   NaN            Kano
3  29.0            Abia
4  25.0             NaN


Unnamed: 0,age,state_of_origin
0,15.0,Lagos
1,17.0,Cross River
3,29.0,Abia


As we will observe from the preceding code block, `dropna()` drops all rows that contain a missing value. But we may not want that. We may rather, for example, want to drop columns with missing data or drop rows where all the observations are missing or better still remove consequent on the number of observations present in a particular row.

Let’s see examples of this option. First let’s expand our example dataset.

In [None]:
my_DF = pd.DataFrame({
    'Capital': ['Yola', np.nan, np.nan, 'Port-Harcourt', 'Jalingo'],
    'Population': [3178950, np.nan, 2321339, np.nan, 2294800],
    'State': ['Adamawa', np.nan, 'Yobe', np.nan, 'Taraba'],
    'LGAs': [22, np.nan, 17, 23, 16]})
my_DF 

Unnamed: 0,Capital,Population,State,LGAs
0,Yola,3178950.0,Adamawa,22.0
1,,,,
2,,2321339.0,Yobe,17.0
3,Port-Harcourt,,,23.0
4,Jalingo,2294800.0,Taraba,16.0


Drop columns with `NaN`. This option is not often used in practice.

In [None]:
my_DF.dropna(axis=1)

0
1
2
3
4


Drop rows where all the observations are missing.

In [None]:
my_DF.dropna(how='all')

Unnamed: 0,Capital,Population,State,LGAs
0,Yola,3178950.0,Adamawa,22.0
2,,2321339.0,Yobe,17.0
3,Port-Harcourt,,,23.0
4,Jalingo,2294800.0,Taraba,16.0


Drop rows based on an observation threshold. By adjusting the `thresh` attribute, we can drop rows where the number of observations in the row is less than the `thresh` value.

In [None]:
# Drop rows where number of NaN is less than 3
my_DF.dropna(thresh=3) 

Unnamed: 0,Capital,Population,State,LGAs
0,Yola,3178950.0,Adamawa,22.0
2,,2321339.0,Yobe,17.0
4,Jalingo,2294800.0,Taraba,16.0


### Imputing Values into Missing Data

Imputing values as substitutes for missing data is a standard practice in preparing data for machine learning. Pandas has a `fillna()` function for this purpose. A simple approach is to fill `NaNs` with zeros.

In [None]:
my_DF.fillna(0) # we can also run my_DF.replace(np.nan, 0)

Unnamed: 0,Capital,Population,State,LGAs
0,Yola,3178950.0,Adamawa,22.0
1,0,0.0,0,0.0
2,0,2321339.0,Yobe,17.0
3,Port-Harcourt,0.0,0,23.0
4,Jalingo,2294800.0,Taraba,16.0


Another tactic is to fill missing values with the mean of the numerical column value (or the mode for categorical features).

In [None]:
my_DF.fillna(my_DF.mean()) 

Unnamed: 0,Capital,Population,State,LGAs
0,Yola,3178950.0,Adamawa,22.0
1,,2598363.0,,19.5
2,,2321339.0,Yobe,17.0
3,Port-Harcourt,2598363.0,,23.0
4,Jalingo,2294800.0,Taraba,16.0


## Data Aggregation (Grouping)

We will touch briefly on a common practice in data science, and that is grouping a set of data attributes, either for retrieving some group statistics or applying a particular set of functions to the group. Grouping is commonly used for data exploration and plotting graphs to understand more about the dataset. Missing data are automatically excluded in a grouping operation.
Let’s see examples of how this works.

In [None]:
# create a data frame
my_DF = pd.DataFrame({
    'Sex': ['M', 'F', 'M', 'F','M', 'F','M', 'F'],
    'Age': np.random.randint(15,60,8),
    'Salary': np.random.rand(8)*10000})
my_DF 

Unnamed: 0,Sex,Age,Salary
0,M,54,6343.357261
1,F,40,3034.746908
2,M,46,2202.40119
3,F,44,4453.729352
4,M,50,5914.748574
5,F,23,3924.592456
6,M,18,3610.366344
7,F,32,7568.475998


Let’s find the mean age and salary for observations in our dataset grouped by `Sex`.

In [None]:
my_DF.groupby('Sex').mean()

Unnamed: 0_level_0,Age,Salary
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
F,34.75,4745.386178
M,42.0,4517.718342


We can group by more than one variable. In this case for each `Sex` group, also group the `Age` and find the mean of the other numeric variables.

In [None]:
my_DF.groupby([my_DF['Sex'], my_DF['Age']]).mean() 

Unnamed: 0_level_0,Unnamed: 1_level_0,Salary
Sex,Age,Unnamed: 2_level_1
F,23,3924.592456
F,32,7568.475998
F,40,3034.746908
F,44,4453.729352
M,18,3610.366344
M,46,2202.40119
M,50,5914.748574
M,54,6343.357261


Also, we can use a variable as a group key to run a group function on another variable or sets of variables.

In [None]:
my_DF['Age'].groupby(my_DF['Salary']).mean() 

Salary
2202.401190    46
3034.746908    40
3610.366344    18
3924.592456    23
4453.729352    44
5914.748574    50
6343.357261    54
7568.475998    32
Name: Age, dtype: int64

## Statistical Summaries

Descriptive statistics is an essential component in data science. By investigating the properties of the dataset, we can gain a better understanding of the data and the relationship between the variables. This information is useful in making decisions about the type of data transformations to carry out or the types of learning algorithms to spot check. Let’s see some examples of simple statistical functions in Pandas.
First, we’ll create a Pandas dataframe.

In [None]:
 my_DF = pd.DataFrame(np.random.randint(10,80,[7,4]),\
            columns=['First','Second','Third', 'Fourth'])

We use the `describe` function to obtain summary statistics of a dataset. Eight statistical measures are displayed. They are count, mean, standard deviation, minimum value, 25th percentile, 50th percentile or median, 75th percentile, and the maximum value.

In [None]:
my_DF.describe() 

Unnamed: 0,First,Second,Third,Fourth
count,7.0,7.0,7.0,7.0
mean,35.714286,31.0,38.428571,49.285714
std,13.877011,19.485037,11.602545,12.539462
min,14.0,10.0,25.0,34.0
25%,28.0,14.0,28.5,40.5
50%,37.0,31.0,38.0,52.0
75%,43.0,43.0,47.0,53.5
max,57.0,62.0,55.0,71.0


### Correlation

Correlation shows how much relationship exists between two numerical variables. Parametric machine learning methods such as logistic and linear regression can take a performance hit when variables are highly correlated. The correlation values range from –1 to 1, with 0 indicating no correlation at all. –1 signifies that the variables are strongly negatively correlated, while 1 shows that the variables are strongly positively correlated. In practice, it is safe to eliminate variables that have a correlation value greater than –0.7 or 0.7. A common correlation estimate in use is the Pearson’s correlation coefficient.

In [None]:
my_DF.corr(method='pearson')

Unnamed: 0,First,Second,Third,Fourth
First,1.0,0.514681,-0.468032,0.228503
Second,0.514681,1.0,-0.043496,-0.338338
Third,-0.468032,-0.043496,1.0,-0.822346
Fourth,0.228503,-0.338338,-0.822346,1.0


### Skewness

Another important statistical metric is the skewness of the dataset. Skewness is when a bell-shaped or normal distribution is shifted toward the right or the left. Pandas offers a convenient function called `skew()` to check the skewness of each variable. Values close to 0 are more normally distributed with less skew.

In [None]:
my_DF.skew() 

First    -0.064567
Second    0.436020
Third     0.242302
Fourth    0.522977
dtype: float64

## Importing Data


Again, getting data into the programming environment for analysis is a fundamental and first step for any data analytics or machine learning task. In practice, data usually comes in a comma-separated value, csv, format. We c an also export a DataFrame back to csv.

In [None]:
my_DF.to_csv('file_name.csv', index=False)

In [None]:
my_DF = pd.read_csv('file_name.csv', sep=',')
my_DF

Unnamed: 0,First,Second,Third,Fourth
0,53,63,62,64
1,49,60,46,60
2,45,62,16,46
3,46,17,48,45
4,46,77,39,43
5,44,58,48,68
6,48,65,15,27
