# An Introduction to The Initial Stage of the Data Science Process  using Pandas 

Hi,

I am an advocate of learning through teaching, so in writing this tutorial, hopefully you and I can both learn about Pandas. Moreover, when a subject is fun it's much more engaging and you can learn so much more, so why not use some Pokémon! During the 1990's, Pokémon became a world wide cultural phenomenon. Their popularity continues to this day across card games, movies, anime TV series, video games and mobile apps, including Pokémon Go, download an incredible 750 million times in it's first complete year. 

This Notebook was designed to give myself a brief introduction to creating Kaggle Notebooks while also learning some more Pandas in Python. Here, I will utilise some of Panda's more common and useful functions, while traversing the *initial* stages of a typical data science process. In this tutorial, we will concentrate on *data preparation*, *data cleansing* and some *exploratory data analysis*. In future, we might want to also look at the remaining stages of *data analysis* and *communicating* our findings through *visualisation*

Much of this tutorial was based around the excellent [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html) tutorial and I was inspired by I, Coder's fantastic article [here](https://www.kaggle.com/ash316/learn-pandas-with-pokemons/notebook) 

***By Grant Patience, 09/April/2018***
   

# Table of Contents
1. [What is Pandas?](#whatispandas)
2. [Prepare the Data](#preparethedata)
3. [Data Exploration - Viewing and Inspecting the Data](#dataexploration)
    1. [Selecting Data](#selectingdata)
    2. [Filtering and Min/Max values](#filteringminmax)
    3. [Statistics](#statistics)
    4. [Super Useful Statistics Function](#describe)
4. [Cleanse the Data](#cleansethedata)
5. [Conclusion](#conclusion)
5. [References](#references)

# **1. What is Pandas?** <a class="anchor" id="whatispandas"></a>

Pandas is a Python libary that stands for "Python Data Analysis Library".  It is one of the most preferred tools in data wrangling. It makes analysing data in Python super easy.

It provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. 

The object that makes Pandas so useful is an object called a dataframe. It is a structure with rows and columns, much like a relational database table or Excel file, and it can be used to to store CSV,  TSV files, or even a SQL result set.

In [3]:
import pandas as pd #importing pandas for dataframe manipulation, data processing, CSV file I/O (e.g. pd.read_csv)

# **2. Prepare the Data**<a class="anchor" id="preparethedata"></a>

At this stage, you will likely have spoken with your business stakeholders and had discussions around the issue they'd like you to resolve. Perhaps they'd like to know distribution of Pokémon types, or the mean Attack and Defence statistics of Lengendaries, you might be asked to predict or estimate a new Pokémon, or maybe you'd just like to know which Pokémon to concentrate your resource into to catching (It's MewTwo!). 

Regardless, before you start by loading your data and studyingit. This will inform your future actions and will guide you to understand what actions you will need to take in order to prepare and clean the data. 

In reality, we will always have some cleansing to do, our data is rarely perfectly formed, complete or cleansed. So before we begin, we want to check how our dataframe looksfor completeness, resolve anomolies and perform any QA (Quality Assurance). We will perform some basic operations to familiarise ourselves with the dataset and discover what we're dealing with.

In [57]:
#read file and save to a pandas dataframe, a two-dimensional, size-mutatable, potentially heterogeneous data structure 
df =  pd.read_csv('../input/Pokemon.csv') 

In [58]:
#Show the columns contained within the dataframe. 
df.columns

In [6]:
#Show the shape of the dataframe, number of rows by number of columns
df.shape

In [59]:
#Show the data type for each column in the dataframe
df.dtypes

In [60]:
#Show the indexes of the dataframe, you can use the index to filter or select data 
df.index

In [61]:
#Perhaps we want to change the index?
#set the index to the named attribute in the dataframe
df = df.set_index('Name')

#Show the index, which will now be NAME
df.index

In [62]:
#sort values by the index, and append .head() to return only our head records
df.sort_index().head()

In [63]:
#Show the index data type and memory information
df.info()

# **3. Data Exploration - Viewing and Inspecting the Data**<a class="anchor" id="dataexploration"></a>

Nice work, we've read in our data and now know the general shape of our data along with it's size, structure and set our Index to the more  intuitive ```Name``` attribute.

It might come in handy in future, but it doesn't inform us much about what our data actually is. How about we start to dig into the data itself? Let's use some of Pandas functions that let you view the data

In [64]:
#df.head(n) returns a DataFrame holding the first n rows of df. Useful to briefly browse our data. df.tail returns a DataFrame holding the bottom n rows of df. Useful to briefly browse our data

df.head(n=5)

In [65]:
#Using df.sort_values() to sort our data by a specific attribute. Sort values by Attack attribute, default is ascending
#Appending .head(n=5) filters the result set to only the first 5 results
df.sort_values(by='Attack').head(n=5)

In [66]:
#Sort values by TYPE2 attribute, using some of the function's parameters. This time we want descending, and show the NaN last
df.sort_values(by='Type 2', ascending=False, na_position='last').tail(n=5)

In [67]:
#Select a single column to return a Series
df['Type 1'].head(n=5)

# **3.1 Selecting Data**<a class="anchor" id="selectingdata"></a>

Pandas makes our life easy to if you want in comparison to selecting a value from a list or a dictionary. You can select as a dataframe column (```df[col]```) or a series or a few columns (```df[[col1, col2]]```).  You can select data by it's position (```df.iloc[0]```), or by it's index (```df.loc['index_one']```) . In order to select the first row from your dataset, you can pass in paramater ```df.iloc[0,:]``` or to select the first row and columns you can use ```df.iloc[0,0]```. 


In [68]:
#Slice rows starting at row 10 up to row 15
df[10:15]

In [69]:
#Select data based on the index - retrieves all data for index label Bulbasaur
df.loc['Bulbasaur'] 

In [70]:
#Same return, but using index position key value
df.iloc[0] 

In [71]:
#Selecting on a multi-axis using label and attribute name
df.loc['Bulbasaur':'Venusaur',['Type 1','Type 2']]

In [72]:
#Alternatively, Selecting on a multi-axis using integers
df.iloc[0:3,1:3]

In [73]:
#Showing how to return all attributes for a multi-axis subset
df.iloc[0:3,:]

# **3.2 Filtering and Min/Max values**<a class="anchor" id="filteringminmax"></a>

Now we get to the more interesting stuff. It's time to get to know our data more deeply and better understand the inherent patterns within data so that you have a sound understanding of what's at hand. Here we would start to build models, perform data mining, text analysis and much more.

This will help you choose which attributes or categories are the most important to your task, or to develop an appropriate predictive model. We'll then determine whether there is sufficient data to move forward with the following steps. Perhaps you'd need to find new data sources with more up to date data or data to augment the data set, perhaps you'd want to bring in information on the Pokémon regions, trainers or some other relevant information?  This process is often iterative.

To understand our Pokémon, lets run some more basic Pandas filtering, Min and Max functions.

In [74]:
#Return data by filtering on a columns integer value   
df[df.HP > 150]


In [75]:
#Give us a distinct list of TYPE1 values
df['Type 1'].unique()

In [76]:
#Show us the results where we've filtered for a specific value in an attribute
df[df['Type 1']=='Dragon'].head(5) 

In [77]:
#Return data by filtering on multiple columns
df[((df['Type 1']=='Fire')  & (df['Type 2']=='Dragon'))]

In [78]:
#Alternatively via boolean indexing we can use isin() method to filter 
df[df['Type 1'].isin(['Dragon'])].head(n=5)

In [79]:
#Return index value with highest value
df['Defense'].idxmax()

In [80]:
#Return index value with lowest value
df['Attack'].idxmin()

# **3.3 Statistics**<a class="anchor" id="statistics"></a>

Using Pandas, you can get some basic statistics on the dataframe or individual columns. This is really handy for quickly getting statistics.

The following statistic functions return the statistics on all columns within the dataframe.

* ```df.mean()``` Returns the mean value
* ```df.corr() ``` Returns the correlation between columns
* ```df.count() ``` Returns the counts of non-null values
* ```df.max() ``` Returns the highest value
* ```df.min() ``` Returns the lowest value
* ```df.median() ``` Returns the median value
* ```df.std() ``` Returns the standard deviation

To apply the function to only one column, it's just a case of appending the function to the dataframe selections that you learned above *df[column].mean()*

In [81]:
df.max()

In [82]:
#Take a look at some of the averages of each attribute
df.mean(axis=0)

In [83]:
#Get mean value of a specific column
df['HP'].mean()

In [84]:
#Histogramming, getting the counts of values in our data. Now we can start to see the distribution of Pokémon over their types
df['Type 1'].value_counts()

In [85]:
#Filter out dataframe for only Legendary Pokémon, and get the mean Attack as per our stakeholders request
df[(df['Legendary']==True)].Attack.mean()

# **3.4 Super Useful Statistics Function**<a class="anchor" id="describe"></a>

This is my new favourite function, ```.describe()``` will show how a brief statistical summary of the entire data frame

In [86]:
#Show a brief statistical summary of the data frame
df.describe()

 # **4. Cleanse the Data**<a class="anchor" id="cleansethedata"></a>

We've now familiarised ourselves with the dataset and some of the basic functions of Pandas. Looks like we need to tidy the data up a bit, so it's time to start cleaning up the data a little bit. In reality, our datasets are normally quite dirty and messy so this step is  vitaly important in data analysis.

This step is entirely down to the content of the data and will require you to do some work to cleanse the data. For example, we might want to check for Null values ```df.isnull()```, or ```df.isnull().sum()``` to get a summary. To deal with them you might want to drop the rows (```df.drop()```) or to populate a value to replace the null ```(df.fillna(x))```python  or fill in the blanks with the mean value (```df.fillna(df.mean())```).

I find that carrying out cleansing in stages can be helpful if we ever need to revert to a version of the dataset. I will demonstrate how to do this in Pandas. However, with our Pokemon dataset, we do not really need it.


In [90]:
#Before we maniuplate, it might be prudent to create a copy of our data so we can return to it if needed
df2 = df.copy()
#drop the columns with axis=1; By the way, axis=0 is for rows
df2=df.drop(['#'],axis=1) 

#Showing that # Column is now dropped
df2.head(n=5)         

In [91]:
#pandas primarily uses the np.nan to represent missing data. It is by default not included in computations
#Get the boolean mask where values are nan            
pd.isna(df2).head(10)

In [92]:
#Again, take a copy. This time genuinley we will revert as the next step is just a demonstration
df3 = df2.copy()

#Drop any rows containing NaN values. However, sometimes you may not want to do this as it might start to affect your analysis
df3.dropna(how='any')    
pd.isna(df3).head(10)

In [93]:
# The new index of Names contains erroneous text. We want to remove all the text before "Mega"
df2.index = df.index.str.replace(".*(?=Mega)", "")

df2.head(5)

# **5. Conclusion**<a class="anchor" id="conclusion"></a>

And there we have it. Some basic Pandas commands that I've read about from the documentation or relevent articles, but I feel like I've learned more about how to use Python for data analysis. This should give us a good basis for picking up a new dataset and getting started by playing around with the data a little for data exploration, understanding and manipulation.

Thank you for reading. Any **feedback** is most welcome and if you found this useful, please **Upvote** . Comments, suggestions or questions are most welcome!

# **6. References**<a class="anchor" id="references"></a>
* [Confident Daya Skills, Kiril Eremenko (2018)]( https://www.amazon.co.uk/Confident-Data-Skills-Fundamentals-Supercharge/dp/0749481544)
* [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/index.html)