# Moving from Excel to Jupyter with Python Pandas

<p style="text-align: center;">
  <img width="200" alt="COE Image" src="https://pandas.pydata.org/static/img/pandas_white.svg">
</p>
<hr style="height:10px;border-width:0;color:gray;background-color:gray">

[Home](https://pandas.pydata.org/)
[Install](https://pandas.pydata.org/getting_started.html)
[Getting Started](https://pandas.pydata.org/docs/getting_started/index.html)
[Users Guide](https://pandas.pydata.org/docs/user_guide/index.html)
[API](https://pandas.pydata.org/docs/reference/index.html)
[About](https://pandas.pydata.org/about/index.html)
[Ecosystem](https://pandas.pydata.org/community/ecosystem.html)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

Let's talk about [Pandas](https://pandas.pydata.org/) core concepts 



Pandas is one of the single best tools to conduct real-world data analysis with Python in a Jupyter Notebook. It allows us to:

1. clean data, 
2. wrangle data, 
3. make visualizations, 
4. and more.

Pandas is really a supercharged Microsoft Excel. Most of the tasks you do in Excel can also be done in Pandas. 

I'll briefly introduce you to how Pandas outperforms Excel.

In this introduction to Pandas:
1. I'll compare Pandas dataframes and Excel Spreadsheet, 
2. show different ways to create a dataframe,
3. how to make pivot tables.

Note: Before learning Pandas, you should know at least know the basics of Python. 

If you are new to Python, take a look at this [guide](https://towardsdatascience.com/python-core-concepts-for-data-science-explained-with-code-decfff497674) to get started with Python.

## Why Excel Users Should Learn Python/Pandas

Tasks such as data cleaning, data normalization, visualization, and statistical analysis can be performed on both Excel and Pandas. That said, Pandas has some major benefits over Excel. 

1. Limitation by size: Excel can handle around 1 million rows, while Python can handle millions and millions of rows (the limitation is on PC compute power and memory)
2. Complex data transformation: Memory-intensive computations in Excel can crash a workbook. Python can handle complex computations without any major problem. I can't count (well I can) how many times I've crashed an Excel Worksheet.
3. Automation: Excel was not designed to automate tasks. You can create a macro or use VBA to simplify some tasks, but that is the limit. Python can go beyond that with its hundreds of free libraries available.
4. Cross-platform capabilities: On Excel, you might find some incompatibilities between formulas in Windows and MacOS. This also happens when sharing Excel files with people that don not have English as the default language on their version of Microsoft Excel. In contrast, Python code remains the same regardless of the operating system or language set on a computer.



## Pandas DataFrames & Excel Spreadsheets

The two main data structures in Pandas are series and dataframe. 

The first is a 1-dimensional array(series), while the second is a 2-dimensional array(dataframe).

In Pandas, we mainly work with dataframes. A Pandas dataframe is the equivalent of an Excel spreadsheet. 

Pandas dataframes — just like Excel spreadsheets — have 2 dimensions or axes.

A dataframe has rows and columns where the columns are know as a series. 

On top of a dataframe, you will see the name of the columns.

On the left side, there is the index where by default the index in Pandas starts with 0.


The intersection of a row and a column is called a data value or simply data.

We can store different types of data such as integers, strings, boolean, and so on.

In [1]:
# I'm importing the libraries that are required for loading and visualization of the data
import pandas as pd
import seaborn as sns

In [2]:
# now that we've imported seaborn let's take a look at the built in seaborn datasets
dataset_names = sns.get_dataset_names()
dataset_names

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'exercise',
 'flights',
 'fmri',
 'gammas',
 'geyser',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'taxis',
 'tips',
 'titanic']

In [4]:
# Let's load one of the datasets
mpg = sns.load_dataset("mpg")

In [5]:
mpg.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


above is a dataframe that shows mpg of various vehicles. 

Notice that each column has a name and each row has an index and each column is considered a series of data and at each intersection is a data or data point.

## Terminology translation between Pandas and Excel

|     | Excel | Pandas|
|:--- |:---  |---:|
|1|Worksheet  |Dataframe|
|2|Column     |Series|
|3|Row Heading|Index|
|4|Row        |Row|
|5|Empty cell |NaN|

Python missing data is represented with NaN, which stands for “Not a Number”. In Excel it just shows an empty cell.

## How to Create a Dataframe

There are multiple ways to create a dataframe. 

1. We can create a dataframe by reading an Excel/CSV file, 
2. using arrays, 
3. and also with dictionaries.

Before creating a dataframe, ensure you have Python and Pandas installed. 

To install Pandas, run the command 
```bash
pip install pandas 
```
on the terminal or command prompt. 

If your are in Jupyter Notebook, you can install it from a code cell by running 

```bash
!pip install pandas
```

## Creating a dataframe by reading in a CSV file

This is without a doubt the easiest way to create a dataframe in Pandas. 

We only need to import pandas(we did this already above), use the read_csv() method and write the name of the Excel or CSV file within parentheses.

We'll read in a CSV file that contains data about the various Kernels that Jupyter notebook can utilize.

In [7]:
# I previously webscraped this using pd.read_clipboard() and then used to_csv() to create this dataset
df_kernels = pd.read_csv('../../Data/kernels.csv')
df_kernels

Unnamed: 0.1,Unnamed: 0,Name,Jupyter/IPython Version,Language(s) Version,3rd party dependencies,Example Notebooks,Notes
0,0,D-lang,Jupyter,DMD,,,
1,1,Micronaut,,"Python>=3.7.5, Groovy>3",Micronaut,https://github.com/stainlessai/micronaut-jupyt...,Compatible with BeakerX
2,2,Agda kernel,,2.6.0,,https://mybinder.org/v2/gh/lclem/agda-kernel/m...,
3,3,Dyalog Jupyter Kernel,,APL (Dyalog),Dyalog >= 15.0,Notebooks,Can also be run on TryAPL's Learn tab
4,4,Coarray-Fortran,Jupyter 4.0,Fortran 2008/2015,"GFortran >= 7.1, OpenCoarrays, MPICH >= 3.2","Demo, Binder demo",Docker image
...,...,...,...,...,...,...,...
148,148,.Net Interactive,Jupyter 4,"C#, F#, Powershell",.Net Core SDK,Binder Examples,
149,149,mariadb_kernel,Jupyter Notebook/Lab,SQL,"Internal Dependencies, MariaDB Server",Binder notebook,A Jupyter kernel for the MariaDB Open Source d...
150,150,ISetlX,Jupyter,SetlX,,Example,
151,151,Ganymede,Jupyter >= 4.0,"Java 11+, Groovy, Javascript, Kotlin, Scala, A...","JShell, Apache Maven Resolver",Examples,


## Let's next import an Excel file

In [13]:
# use read_excel 
df_effect = pd.read_excel('../../Data/Effect-Size-Worksheet.xlsx')
df_effect

Unnamed: 0,Group 1,Group 2,Unnamed: 2,Unnamed: 3,Unnamed: 4,Group 1.1,Group 2.1,Difference,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,8,15,,,,3,5,-2,,Mean of the Difference,-2.111111
1,7,16,,Mean Group 1,10.518519,11,11,0,,Standard Deviation of the Difference,7.196865
2,4,11,,Standard Deviation Group 1,4.987449,2,6,-4,,Cohen's D,-0.293338
3,2,1,,Mean Group 2,9.518519,6,8,-2,,,Medium
4,17,5,,Standard Deviation Group 2,5.323174,11,16,-5,,,
5,16,11,,Pooled SD,5.158044,1,7,-6,,,
6,7,16,,Cohen's D,0.193872,4,15,-11,,,
7,12,4,,,Small,1,9,-8,,,
8,7,8,,,,5,10,-5,,,
9,17,15,,,,0,6,-6,,,


## Creating a dataframe with arrays
To create a dataframe with arrays we need to import Numpy first. Let’s import this library and create an array for two die.

In [14]:
# Load library
import numpy as np

In [15]:
#create a vector as a row (one die)
vector_row = np.array([1,2,3,4,5,6])

In [16]:
#create a vector or series as a column (second die)
vector_column = np.array([[1],
                          [2],
                          [3],
                          [4],
                          [5],
                          [6]])

In [17]:
# let's create our two dimensional array
matrix = vector_row + vector_column
matrix

array([[ 2,  3,  4,  5,  6,  7],
       [ 3,  4,  5,  6,  7,  8],
       [ 4,  5,  6,  7,  8,  9],
       [ 5,  6,  7,  8,  9, 10],
       [ 6,  7,  8,  9, 10, 11],
       [ 7,  8,  9, 10, 11, 12]])

### We can start visualizing some interesting things next

The probability frequency distribution is a collection of the probabilities for each possible outcome. This is how we know that 7 is the most probably number of the two dice. THis is usually expressed as a graph or matrix as in the above example.

We next create a sum table of all the numbers from our matrix

sum_table = (2,3,4,5,6,7,8,9,10,11,12)

Then we create a Frequency Table from the matrix

freq_table = (1,2,3,4,5,6,5,4,3,2,1)

next we create a probability associated with each number 2-12

prob_table = (1/36, 1/18, 1/12, 1/9, 5/36, 1/6, 5/36, 1/9, 1/12, 1/18, 1/36)

we basically divide the frequency by the size of the sample space

So the prob_table is called the __"Probability Frequency Distribution"__


In [19]:
# let's now take our array called matrix and put it into a DataFrame
df = pd.DataFrame(matrix)
df

Unnamed: 0,0,1,2,3,4,5
0,2,3,4,5,6,7
1,3,4,5,6,7,8
2,4,5,6,7,8,9
3,5,6,7,8,9,10
4,6,7,8,9,10,11
5,7,8,9,10,11,12


In [21]:
# notice how the numbers of the dice are off by one on both the index and columns because they start with 0
df.columns

RangeIndex(start=0, stop=6, step=1)

In [22]:
# we can change the index and column to start with 1 instead of zero 
df.columns += 1

In [24]:
df.columns

RangeIndex(start=1, stop=7, step=1)

In [23]:
df

Unnamed: 0,1,2,3,4,5,6
0,2,3,4,5,6,7
1,3,4,5,6,7,8
2,4,5,6,7,8,9
3,5,6,7,8,9,10
4,6,7,8,9,10,11
5,7,8,9,10,11,12


In [25]:
df.index

RangeIndex(start=0, stop=6, step=1)

In [26]:
df.index += 1

In [27]:
df.index

RangeIndex(start=1, stop=7, step=1)

In [28]:
df

Unnamed: 0,1,2,3,4,5,6
1,2,3,4,5,6,7
2,3,4,5,6,7,8
3,4,5,6,7,8,9
4,5,6,7,8,9,10
5,6,7,8,9,10,11
6,7,8,9,10,11,12


## Now we have a die frequency distribution table showing the correct 1-6 values for index and columns