<a href="https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html"> Pandas official community tutorials </a>

<a href="https://www.dataschool.io/easier-data-analysis-with-pandas/">Best Pandas video series tutorial by DataSchool</a>

<a href="https://stackoverflow.com/questions/tagged/pandas">Pandas community Stack Overflow</a>

## 1. What is Pandas?

#### Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Numpy (python).

## 2. Features of Pandas.

<img src="pandas_features.png" alt="Pandas Features (Wikipedia)" width="900" height="650">

There are many things to like about pandas: It's well-documented, has a huge amount of <a href="https://stackoverflow.com/questions/tagged/pandas">community support</a> , is under active development, and plays well with other Python libraries (such as matplotlib, scikit-learn, and seaborn).

## 3. What kind of data does Pandas handle?

##### Pandas supports two types of data structures:

<ul>
    <li>Series</li>
    <li>Data Frames </li>
</ul>

#### (a) What is a Series?

<b>Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).</b>

```python
pd.Series(data, index=index)
```

<b>Here, data can be many different things:</b>
    
<ul>
        <li>A Python Dictionary</li>
        <li>An ndarray</li>
        <li>A scalar value (like 9)</li>
</ul>

### (b) What is a Data Frame?

##### DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

<ul>
    <li>Dict of 1D ndarrays, lists, dicts, or Series</li>
    <li>2-D numpy.ndarray</li>
    <li>A Series</li>
    <li>Another DataFrame</li>
</ul>

```python
pd.DataFrame(data, index, columns)
```

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

<img src='df.png' alt="DataFrame representation" width=600 height=700>

## 4. Importing Pandas

In [1]:
import pandas as pd

## 5. Essential Basic Functionality - I

In [2]:
import numpy as np                         # needed in some parts

### (a) Object Creation - Series, DataFrames

In [3]:
s1 = pd.Series([1, 5, 7, np.nan, 23, 77])
s1

0     1.0
1     5.0
2     7.0
3     NaN
4    23.0
5    77.0
dtype: float64

##### Pandas objects have a number of attributes enabling you to access the metadata.

<ul>
    <li><b>shape:</b> Gives the axis dimensions of the object similar as numpy arrays.</li>
    <li><b>Axis Labels:</b>
        <ul>
            <li>Series: index only(axis).</li>
            <li>DataFrames: index(rows) and columns.</li>
        </ul>
    </li>
</ul>
An example is demonstrated below:

In [4]:
df1 = pd.DataFrame(np.array([[1, -2, 30, 0], [41, 15, 66, -5], [7, 8, 9, -10]]),
                   columns=['p', 'q', 'r', 's'], index=['x', 'y', 'z'])
df1

Unnamed: 0,p,q,r,s
x,1,-2,30,0
y,41,15,66,-5
z,7,8,9,-10


##### Note:
```python
df.to_numpy() gives a numpy representation of the DataFrame.
```

In [5]:
df1.to_numpy()

array([[  1,  -2,  30,   0],
       [ 41,  15,  66,  -5],
       [  7,   8,   9, -10]])

#### Earlier df.values() was used to achieve the same result as above but going forward it's recommended to use:
<pre>
i) .array()
ii) .to_numpy()
</pre>

<hr>

### (b) Boolean Reductions

You can apply the following reductions to provide a way to summarize a boolean result.
```python
empty, any(), all(), and bool()
``` 


In [6]:
df1

Unnamed: 0,p,q,r,s
x,1,-2,30,0
y,41,15,66,-5
z,7,8,9,-10


In [7]:
(df1>0).any()           # The column s has all <= 0 so returns False.

p     True
q     True
r     True
s    False
dtype: bool

In [8]:
(df1>0).all()

p     True
q    False
r     True
s    False
dtype: bool

##### We can reduce to a final boolean value.

In [9]:
(df1>0).all().any()

True

In [10]:
df1.empty

False

<u>Notice the following</u>

In [11]:
np.nan == np.nan

False

<hr>

### (c) Descriptive Statistics

There exists a large number of methods for computing descriptive statistics and other related operations on Series, DataFrame. Most of these are aggregations (hence producing a lower-dimensional result) like sum(), mean(), and quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, …}, but the axis can be specified by name or integer:

<pre>
i)<b>Series:</b>no axis argument necessary
ii)<b>DataFrame:</b>"index"(default axis=0), "columns"(axis=1)
</pre>

In [12]:
df1

Unnamed: 0,p,q,r,s
x,1,-2,30,0
y,41,15,66,-5
z,7,8,9,-10


In [13]:
df1.sum()

p     49
q     21
r    105
s    -15
dtype: int64

In [14]:
df1.sum(1)              # axis=1

x     29
y    117
z     14
dtype: int64

##### <u>Note</u>: You can skip the NaN values by setting the skipna attribute to True inside any descriptive statistics function.

Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standardization (rendering data zero mean and standard deviation of 1), very concisely:

In [15]:
df1.mean()

p    16.333333
q     7.000000
r    35.000000
s    -5.000000
dtype: float64

In [16]:
df2 = (df1 - df1.mean())/df1.std()
df2

Unnamed: 0,p,q,r,s
x,-0.710812,-1.05337,-0.173448,1.0
y,1.14348,0.936329,1.075378,0.0
z,-0.432668,0.117041,-0.90193,-1.0


##### Some more descriptive statistics functions:

<img src="ds_func.png" alt="Pandas DS functions" width=300 height=150>

In [17]:
df1.skew()

p    1.582523
q   -0.519470
r    0.757035
s    0.000000
dtype: float64

#### Summarizing data: describe

In [18]:
df1.describe()

Unnamed: 0,p,q,r,s
count,3.0,3.0,3.0,3.0
mean,16.333333,7.0,35.0,-5.0
std,21.571586,8.544004,28.827071,5.0
min,1.0,-2.0,9.0,-10.0
25%,4.0,3.0,19.5,-7.5
50%,7.0,8.0,30.0,-5.0
75%,24.0,11.5,48.0,-2.5
max,41.0,15.0,66.0,0.0


In [19]:
df1.p.describe()

count     3.000000
mean     16.333333
std      21.571586
min       1.000000
25%       4.000000
50%       7.000000
75%      24.000000
max      41.000000
Name: p, dtype: float64

### (d) Value Counts (Histogram mode)

The value_counts() Series method and top-level function computes a histogram of a 1D array of values. It can also be used as a function on regular arrays:

In [20]:
data = np.random.randint(0, 9, size=70)
df3 = pd.Series(data=data)
df3.head()

0    4
1    8
2    8
3    1
4    8
dtype: int32

In [21]:
df3.value_counts()

8    13
5    11
1    10
6     9
7     8
4     7
3     5
0     4
2     3
dtype: int64

### (e) Discretization and quantiling

Continuous values can be discretized using the <b>cut()</b>(bins based on values) and <b>qcut()</b>(bins based on sample quantiles) functions.

In [22]:
arr1 = np.random.uniform(0, 10, 50)
arr1

array([5.17538357, 8.14428805, 9.9962756 , 0.55305944, 4.88880761,
       1.42499739, 6.84171797, 2.74651899, 1.68054911, 6.39603468,
       2.69736882, 2.94185812, 0.80335143, 3.38662762, 0.64211494,
       3.62846163, 8.76264264, 1.02770565, 9.4045288 , 1.6339376 ,
       2.89244431, 5.4732346 , 5.04116147, 4.27401359, 6.99971177,
       5.92832892, 9.37997223, 2.08633634, 7.57831967, 2.79381578,
       9.66108   , 9.63004171, 5.89835699, 1.52362602, 3.15385971,
       1.64042774, 8.89569393, 9.10952997, 1.33594382, 4.13452733,
       5.4494158 , 8.75168836, 2.48723568, 7.00076682, 4.76634114,
       2.96145941, 8.31802864, 7.45255972, 7.8003416 , 9.98738527])

In [23]:
pd.cut(arr1, 5)                      # Divides the data into five intervals

[(4.33, 6.219], (8.108, 9.996], (8.108, 9.996], (0.544, 2.442], (4.33, 6.219], ..., (2.442, 4.33], (8.108, 9.996], (6.219, 8.108], (6.219, 8.108], (8.108, 9.996]]
Length: 50
Categories (5, interval[float64]): [(0.544, 2.442] < (2.442, 4.33] < (4.33, 6.219] < (6.219, 8.108] < (8.108, 9.996]]

We can also pass infinite values to the bins.

In [24]:
pd.cut(arr1, [-np.inf, 0, np.inf])

[(0.0, inf], (0.0, inf], (0.0, inf], (0.0, inf], (0.0, inf], ..., (0.0, inf], (0.0, inf], (0.0, inf], (0.0, inf], (0.0, inf]]
Length: 50
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]

In [25]:
factor = pd.qcut(arr1, [0, 0.25, 0.5, 0.75, 1])    # Divides the data into 25,50,75% quantiles
factor

[(4.965, 7.745], (7.745, 9.996], (7.745, 9.996], (0.552, 2.71], (2.71, 4.965], ..., (2.71, 4.965], (7.745, 9.996], (4.965, 7.745], (7.745, 9.996], (7.745, 9.996]]
Length: 50
Categories (4, interval[float64]): [(0.552, 2.71] < (2.71, 4.965] < (4.965, 7.745] < (7.745, 9.996]]

<hr>