![](http://i.imgur.com/c7Kdkuu.png)


- <a href='#1'>1. Introduction</a>  
- <a href='#2'>2. Data Cleaning</a>
- <a href='#3'>3. Exploratory Data analysis</a>
- <a href='#4'>4.  Data Visualization</a>





# <a id='1'>1. Introduction</a>


PANDAS stands for “Python Data Analysis Library”,  the name is derived from the term “panel data”.  It is a super useful  open source, free to use library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Great thing about Pandas is that it takes data (CSV or a SQL database) and creates a Python object with rows and columns called data frame that looks very similar to table in a statistical software (Excel for example) 

Objective of this Kernel is to teach and refresh PANDAS  skills

## Background

Pandas is the most popular Python library for doing data analysis. While it does offer quite a lot of functionality, it is also regarded as a fairly difficult library to learn well. Some reasons for this include:

1. There are often multiple ways to complete common tasks
2. There are over 240 DataFrame attributes and methods
3. There are several methods that are aliases (reference the same exact underlying code) of each other
4. There are several methods that have nearly identical functionality
5. There are many tutorials written by different people that show different ways to do the same thing
6. There is no official document with guidelines on how to idiomatically complete common tasks
7. The official documentation, itself contains non-idiomatic code

In [None]:
#Load libraries and the dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('../input/insurance.csv')
df.head()

In [None]:
df.dtypes #check datatype

**Columns from the dataset - **

* age: age of primary beneficiary
* sex: gender
* bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
* children: Number of children covered by health insurance / Number of dependents
* smoker: Smoking
* region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
* charges: Individual medical costs billed by health insurance


# <a id='2'>2. Data Cleaning</a>



Data cleaning is one of most critical step in data analysis. It is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. 

Here are several key benefits that come out of the data cleaning process:

* It removes major errors and inconsistencies that are inevitable when multiple sources of data are getting pulled into one dataset.
* Using tools to cleanup data will make everyone more efficient since they’ll be able to quickly get what they need from the data.
* Fewer errors means happier customers and fewer frustrated employees.
* The ability to map the different functions and what your data is intended to do and where it is coming from your data.

**Inspecting the data**

To start data cleaning, take a look at the data. If it is a large dataset, take a look at the top 20 rows, the bottom 20 rows, and a 20 row random sample by inspecting the data. You can inspect the data using pandas as mentioned below: 

* df.mean() - returns the mean of all columns
* df.corr() - returns the correlation between columns in a data frame
* df.count() - returns the number of non-null values in each data frame column
* df.max() - returns the highest value in each column
* df.min() - returns the lowest value in each column
* df.median() - returns the median of each column
* df.std() - returns the standard deviation of each column


**Missing values:** We always check for missing values in the data by running pd.isnull() which checks for null values, and returns a boolean array (an array of true for missing values and false for non-missing values). In order to get a sum of null/missing values, run pd.isnull().sum(). pd.notnull() is the opposite of pd.isnull(). After you get a list of missing values you can get rid of them, or drop them by using df.dropna() to drop the rows or df.dropna(axis=1) to drop the columns. A different approach would be to fill the missing values with other values by using df.fillna(x) which fills the missing values with x (you can put there whatever you want) or s.fillna(s.mean()) to replace all null values with the mean (mean can be replaced with almost any function from the statistics section).

**Replace values:** It is sometimes necessary to *replace values* with different values. For example, s.replace(1,'one') would replace all values equal to 1 with 'one'. It’s possible to do it for multiple values: s.replace([1,3],['one','three']) would replace all 1 with 'one' and 3 with 'three'. You can also rename specific columns by running: df.rename(columns={'old_name': 'new_ name'}) or use df.set_index('column_one') to change the index of the data frame.

Now, that we know some of the basic things PANDAS can do, let's apply the knowledge in our dataset df

In [None]:
df.mean()

In [None]:
df.corr()

In [None]:
df.count()

In [None]:
df.max()

In [None]:
df.std()

In [None]:
df.isnull().sum(axis=0) #check null

We have no null values in our dataset.

# <a id='3'>3. Exploratory data analysis</a>

In this section, we will do a bit of exploratory data analysis (EDA) using PANDAS. 

### Crosstabs

In [None]:
#Crosstab - we can validate some basic hypothesis using PANDAS crosstab
pd.crosstab(df["sex"],df["region"],margins=True)

As seen above, we can use Crosstab to validate some basic hypothesis. For instance, in this case,  we wanted to see the distribution of male vs female in different regions. These are absolute numbers. But, percentages can be more intuitive in making some quick insights. We can do this using the apply function as shown below:

In [None]:
def percConvert(ser):
    return ser/float(ser[-1])
pd.crosstab(df["sex"],df["region"],margins=True).apply(percConvert, axis=1)

### Boolean Indexing

What do you do, if you want to filter values of a column based on conditions from another set of columns? For instance, we want a list of all males who smoke and are from the region = 'southwest'.

Boolean indexing can help here. You can use the following code:

In [None]:
df.loc[(df["sex"]=="male") & (df["smoker"]=="yes") & (df["region"]=="southwest"), ["sex","smoker","region"]].head(10)

###  Sorting dataframes

 Pandas allow easy sorting based on multiple columns. This can be done as:

In [None]:
df_sorted = df.sort_values(['smoker','region'], ascending=False)
df_sorted[['smoker','region']].head(10)

### Pivot Table

Pandas can be used to create MS Excel style pivot tables. For instance, in this case, a key column is “charges” . As an example, we can pivot it using mean amount of each 'sex' and 'smoker' group. 

In [None]:
# Pivot
impute_grps = df.pivot_table(values=["charges"], index=["sex","smoker"], aggfunc=np.mean)
print (impute_grps)

# <a id='4'>4. Data Visualization</a>

Data visualization is a general term that describes any effort to help people understand the significance of data by placing it in a visual context. Patterns, trends and correlations that might go undetected in text-based data can be exposed and recognized easier with data visualization software.

We can use PANDAS and do data visualization. However, please note that there are better libraries like seaborn which are more appealing.



In [None]:
df.boxplot(column="charges",by="region", figsize=(18, 8))

In [None]:
df.boxplot(column="charges",by="smoker", figsize=(18, 8))

In [None]:
df.boxplot(column="charges",by="children", figsize=(18, 8))

In [None]:
df.hist(column="charges",by="sex",figsize=(18, 8), bins=30)

In [None]:
#Scatter 
df.plot.scatter(x='charges', y='bmi', figsize=(18, 8))

In [None]:
# Area Plot
df_new = df.drop(columns = 'charges') #dropping charges for the plot
df_new.plot.area(figsize=(18, 8))

In [None]:
# Kernel Density Estimation plot (KDE)
df['charges'].plot(kind='kde')

Thanks for reading the Kernel. I will **continue updating** this. Please **leave a comment for any suggestions**. 

If you have time, you can learn more in below YouTube video :) 

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("B42n3Pc-N2A")