<div style="color:#006666; padding:0px 10px; border-radius:5px; font-size:18px;"><h1 style='margin:10px 5px'>Inspecting DataFrames and Useful Tips</h1>
</div>

© Copyright Machine Learning Plus

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>1. Inspecting a DataFrame</h2>
</div>

Mostly you will be importing dataframe from pre-existing datasets and not be creating one from scratch. In such cases you will want know know more about the structure and content of the dataframe. 

In [None]:
import numpy as np
import pandas as pd

Import data

In [None]:
df = pd.read_csv("Datasets/Churn.csv")
df

__First this you want to know is: how many rows and colums are present__

In [None]:
df.shape

In [None]:
len(df)

__See Top n and Bottom n rows.__

In [None]:
df.head(10)

In [None]:
df.tail(6)

__Dataframe Info provides the datatypes, # Non null records and memory usage__

In [None]:
df.info()

__Know Memory usage of each column__

In [None]:
df.memory_usage(deep=True)

__Check only the datatypes__

In [None]:
df.dtypes

__Change Boolean to Integer datatype for 'Churn'.__

In [None]:
df['churn'] = df['churn'].astype('int')

In [None]:
# Check Again
df.info()

In [None]:
# Check values
df.head()

Another aspect of examining a dataframe is to study the summary statistics. We will come to that after a short detour to understand how to rename columns.

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>2. Approaches to Renaming Columns</h2>
</div>

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv("Datasets/Churn.csv")
df

__Get column names__

In [None]:
df.columns

__Rename Columns__

In [None]:
df.rename(columns={'account length': 'account_length'})

On checking, the column names will be unchanged. So add `inplace=True`.

In [None]:
df.columns

In [None]:
df.rename(columns={'account length': 'account_length'}, inplace=True)
df.columns

__2. Rename all columns in one shot: Change case__

In [None]:
df.rename(str.upper, axis='columns').head()

__3. Rename all columns in one shot: Replace all space character with "_"__

Use a lambda function.

In [None]:
df.rename(lambda x: x.replace(" ", "_"), axis='columns').head()

### Challenge

1. Change all the column names to title case. That is, "account length" becomes "Account Length".

```
df = pd.read_csv("Datasets/Churn.csv")
```

In [None]:
# Solution
df.rename(lambda x: x.title(), axis='columns').head()

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>3. Summary Statistics</h2>
</div>

In [None]:
import pandas as pd
df = pd.read_csv("Datasets/Churn.csv")

In [None]:
df.describe()

<div class="alert alert-info" style="padding:0px 10px; border-radius:5px;"><p style='margin:10px 5px'><strong>Pro Tip:</strong> `pandas-summary`, a cool python package provides a more elaborate summary statistics</p>
</div>

In [None]:
#!pip install pandas-summary

In [None]:
from pandas_summary import DataFrameSummary
dfs = DataFrameSummary(df)

In [None]:
dfs.columns_stats

__Categorical / String Column__

In [None]:
dfs['state']

__Numeric Column Summary__

In [None]:
dfs['account length']

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>4. Essential Operations</h2>
</div>

When working with data, you are going to need certain operations, very commonly used. 

Let's familiarize them. 
		 

In [None]:
import pandas as pd
df = pd.read_csv("Datasets/Churn.csv")
df.head()

__View Unique items__

In [None]:
df['state'].unique()

__Number of unique items__

In [None]:
df['state'].nunique()

__Number of occurrences of each item__

In [None]:
df['state'].value_counts()

In [None]:
df['state'].value_counts(normalize=True)

In [None]:
df['state'].value_counts(normalize=True).sum()

__Get first 'n' rows order by a given column__

In [None]:
df.nlargest(5, 'account length')

__Drop a column__

Use `inplace=True` to effect in the dataframe itself.

In [None]:
df.drop(columns='churn')

In [None]:
df.head()

__Drop records by index__

In [None]:
df.drop(index=[0, 1, 2])

__Transpose a Dataframe__

In [None]:
df.T

In [None]:
df.T.info()

Since transposing will bring element of different datatypes in one column, all columns are converted to 'object' datatype.

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>5. Display Options</h2>
</div>

Pandas provides various control options, used to control how the dataframes are handled.

Some very useful options:

1. _Rows to display: `pd.options.display.max_rows = 100`

   Another way: `pd.set_option("display.max_rows", 5)`
   
2. _Columns to display: `pd.options.display.max_columns`
3. _Col width: `pd.options.display.max_colwidth`
4. _Precision: `pd.options.display.precision = 15`_
5. _Format of float: `pd.options.display.float_format = '{:.2f}%'.format`

In [None]:
# Reset
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.precision')
pd.reset_option('display.float_format')

__By Default, Pandas trucates rows and columns__.

In [None]:
import numpy as np
import pandas as pd

In [None]:
arr = np.random.randint(1,100, (80, 25))
df = pd.DataFrame(arr)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,9,55,9,65,36,85,86,49,44,77,...,57,38,91,77,74,36,54,97,39,60
1,93,64,28,6,60,11,9,70,17,68,...,9,41,83,61,52,37,71,11,52,40
2,91,22,40,98,35,21,77,1,8,44,...,11,76,93,29,72,85,69,42,60,42
3,89,85,4,39,68,32,86,58,4,99,...,35,96,89,95,12,26,78,85,49,77
4,15,39,78,71,84,53,13,48,16,72,...,63,19,26,41,22,39,78,22,61,26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,68,78,92,45,60,10,49,94,74,75,...,83,6,12,87,31,96,51,79,98,24
76,68,28,18,15,6,19,16,14,24,81,...,29,29,84,24,81,54,92,98,42,25
77,97,94,55,5,93,72,3,90,13,97,...,30,12,16,4,67,96,73,74,37,73
78,76,43,59,64,9,65,84,57,3,12,...,45,31,55,29,21,25,66,22,38,81


__Print the current value of settings__

In [None]:
# options
print(pd.options.display.max_rows)
print(pd.options.display.max_columns)

60
20


Let's change the max_rows setting.

__CAVEAT__: The `max_rows` setting only displays more rows, if the setting is greater than the number of rows in the DataFrame.

In [None]:
# Does not change display to 30 rows. Only works if set more than number of rows in dataframe
pd.set_option("display.max_rows", 30)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,9,55,9,65,36,85,86,49,44,77,...,57,38,91,77,74,36,54,97,39,60
1,93,64,28,6,60,11,9,70,17,68,...,9,41,83,61,52,37,71,11,52,40
2,91,22,40,98,35,21,77,1,8,44,...,11,76,93,29,72,85,69,42,60,42
3,89,85,4,39,68,32,86,58,4,99,...,35,96,89,95,12,26,78,85,49,77
4,15,39,78,71,84,53,13,48,16,72,...,63,19,26,41,22,39,78,22,61,26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,68,78,92,45,60,10,49,94,74,75,...,83,6,12,87,31,96,51,79,98,24
76,68,28,18,15,6,19,16,14,24,81,...,29,29,84,24,81,54,92,98,42,25
77,97,94,55,5,93,72,3,90,13,97,...,30,12,16,4,67,96,73,74,37,73
78,76,43,59,64,9,65,84,57,3,12,...,45,31,55,29,21,25,66,22,38,81


In [None]:
# Set more than the number of rows. Works!
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 30)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
0,9,55,9,65,36,85,86,49,44,77,99,45,60,77,32,57,38,91,77,74,36,54,97,39,60
1,93,64,28,6,60,11,9,70,17,68,29,85,30,11,60,9,41,83,61,52,37,71,11,52,40
2,91,22,40,98,35,21,77,1,8,44,37,25,29,16,76,11,76,93,29,72,85,69,42,60,42
3,89,85,4,39,68,32,86,58,4,99,93,38,86,59,66,35,96,89,95,12,26,78,85,49,77
4,15,39,78,71,84,53,13,48,16,72,87,94,39,8,94,63,19,26,41,22,39,78,22,61,26
5,5,69,92,42,65,32,36,79,64,90,28,71,19,91,80,78,91,61,1,70,28,43,96,48,97
6,11,26,61,68,29,7,23,66,84,96,2,17,64,95,91,10,72,63,94,81,66,49,39,14,16
7,56,29,25,66,21,9,30,28,14,83,46,68,57,61,37,94,29,47,41,23,68,54,65,66,45
8,52,40,30,9,85,44,52,35,25,87,11,23,58,21,64,82,11,80,32,72,6,61,92,94,63
9,80,9,12,58,94,15,19,82,4,6,87,63,73,63,51,6,79,43,90,37,34,91,25,38,90


__Control Precision: The number of decimals allowed__

In [None]:
arr = np.random.rand(10)
df = pd.DataFrame(arr)
df.head()

Unnamed: 0,0
0,0.936754
1,0.76162
2,0.153723
3,0.497323
4,0.157203


In [None]:
pd.set_option('display.precision', 3)
df.head()

Unnamed: 0,0
0,0.937
1,0.762
2,0.154
3,0.497
4,0.157


In [None]:
pd.set_option('display.precision', 5)
df.head()

Unnamed: 0,0
0,0.93675
1,0.76162
2,0.15372
3,0.49732
4,0.1572


__Float Format__

In [None]:
pd.set_option('display.float_format', '{:.2f}%'.format)

In [None]:
df.head()

Unnamed: 0,0
0,0.94%
1,0.76%
2,0.15%
3,0.50%
4,0.16%


https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html