# Basics on Pandas

* In the previous ``notebook``, we dove into detail on NumPy and its ``ndarray`` object, which provides efficient storage and manipulation of dense typed arrays in Python.
* Now in this ``notebook``, we'll build the knowledge by looking in detail at the data structures provided by the Pandas library.
* Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a ``DataFrame``.
* ``DataFrame``s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

In [1]:
import pandas as pd
pd.__version__

'1.3.4'

Let's introduce these three fundamental Pandas data structures: the ``Series``, ``DataFrame``, and ``Index``.

---
We will start our code sessions with the standard NumPy and Pandas imports:

In [2]:
import numpy as np
import pandas as pd

## The Pandas Series Object

A Pandas ``Series`` is a one-dimensional array of indexed data.


### Constructing Series objects

```
>>> pd.Series(data, index=index)
```

where ``index`` is an optional argument, and ``data`` can be one of many entities.

For example, ``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence:

In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the output, the ``Series`` gives both a sequence of values and a sequence of indices, by which we can access with the ``values`` and ``index`` attributes.
* The ``values`` are simply like a NumPy array:

In [4]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

The ``index`` is an array-like object of type ``pd.Index``.

In [5]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the Python square-bracket notation:

In [6]:
data[1]

0.5

In [7]:
data[:3]

0    0.25
1    0.50
2    0.75
dtype: float64

This explicit index definition gives the ``Series`` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
For example, if we wish, we can use strings as an index:

In [8]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [9]:
data['b'] #we can access the values by index

0.5

In [10]:
#We can even use non-contiguous or non-sequential indices:

data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [11]:
# we can creat a series by dictionary

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [12]:
population['California']

38332521

In [13]:
population['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

##  Pandas `DataFrame`
A ``DataFrame`` is of a two-dimensional array with both flexible row indices and flexible column names.

``DataFrame()`` function is used to create a dataframe in Pandas. The syntax of creating dataframe is:

``pandas.DataFrame(data, index, columns)``
where,

* ``data:`` It is a dataset from which dataframe is to be created. It can be list, dictionary, scalar value, series, ndarrays, etc.

* ``index:`` It is optional, by default the index of the dataframe starts from 0 and ends at the last data value(n-1). It defines the row label explicitly.

* ``columns:`` This parameter is used to provide column names in the dataframe. If the column name is not defined by default, it will take a value from 0 to n-1.

In [14]:
matrix_data = np.random.randint(1,20,size=20).reshape(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
print("\nThe data frame looks like\n",'-'*45, sep='')
print(df)


The data frame looks like
---------------------------------------------
    W   X   Y   Z
A  10  13  19  19
B  14  18  17  12
C  16  15  10   3
D  18  14  15   1
E   6  18  17  10


In [15]:
df.index

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

In [16]:
df.columns

Index(['W', 'X', 'Y', 'Z'], dtype='object')

In [17]:
df['W']

A    10
B    14
C    16
D    18
E     6
Name: W, dtype: int32

Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:

In [18]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [19]:
# Creating DataFrame from dict of ndarray/lists: 

# intialise data of lists.
data = {'Name':['aziz', 'abdul', 'krish', 'jack'],
        'Age':[20, 21, 19, 18]}
 
# Create DataFrame
df = pd.DataFrame(data,[1,2,3,4])
 
# Print the output.
print(df)

    Name  Age
1   aziz   20
2  abdul   21
3  krish   19
4   jack   18


``Column Selection:`` In Order to select a column in Pandas DataFrame, 
we can either access the columns by calling them by their columns 
name.


In [20]:

# Define a dictionary containing employee data
data = {'Name':['Abdul', 'Aziz', 'Gaurav', 'Anju'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Andhra', 'Tamilnadu', 'Kerala'],
        'Qualification':['MTech', 'BTech', 'MCA', 'Phd']}
 
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data,['A','B','C','D'])

#print full dataframe

print(df)
print('*'*40)
# select two columns
print(df[['Name', 'Qualification']])

     Name  Age    Address Qualification
A   Abdul   27      Delhi         MTech
B    Aziz   24     Andhra         BTech
C  Gaurav   22  Tamilnadu           MCA
D    Anju   32     Kerala           Phd
****************************************
     Name Qualification
A   Abdul         MTech
B    Aziz         BTech
C  Gaurav           MCA
D    Anju           Phd


In [21]:
# retrieving row by loc method

first = df.loc["A"]
print(first)

Name             Abdul
Age                 27
Address          Delhi
Qualification    MTech
Name: A, dtype: object


## Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).

In [22]:
df = pd.read_csv ('data.csv')
print(df)

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


### Quick checking DataFrames
* `.head()`
* `.tail()`
* `.sample()`
* `.info()`
* `.describe()`

In [23]:
df.head() #provides first 5 samples

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0


In [24]:
df.head(3) #provides first 3 samples

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0


In [25]:
df.tail() #provides last 5 samples

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4
168,75,125,150,330.4


In [26]:
df.tail(7)# provides last 7 samples

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
162,45,95,130,270.0
163,45,100,140,280.9
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4
168,75,125,150,330.4


In [28]:
df.sample(10) # provide any 10 ramdom samples

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
70,150,97,129,1115.0
20,60,108,131,364.2
80,30,159,182,319.2
66,150,105,135,873.4
2,60,103,135,340.0
59,45,123,152,321.0
137,45,115,137,318.0
88,45,129,103,242.0
13,60,104,132,379.3
61,160,110,137,1034.4


The ``df.info()`` method prints information about the ``DataFrame``. The information contains the number of ``columns``, ``column labels``, ``column data types``, ``memory usage``, range index, and the number of cells in each column.

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB


---
The ``df.describe()`` method returns description of the data in the ``DataFrame``.

> If the `DataFrame` contains numerical data, the description contains these information for each column:

* count - The number of not-empty values.
* mean - The average (mean) value.
* std - The standard deviation.
* min - the minimum value.
* 25% - The 25% percentile*.
* 50% - The 50% percentile*.
* 75% - The 75% percentile*.
* max - the maximum value.

In [30]:
df.describe()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
count,169.0,169.0,169.0,164.0
mean,63.846154,107.461538,134.047337,375.790244
std,42.299949,14.510259,16.450434,266.379919
min,15.0,80.0,100.0,50.3
25%,45.0,100.0,124.0,250.925
50%,60.0,105.0,131.0,318.6
75%,60.0,111.0,141.0,387.6
max,300.0,159.0,184.0,1860.4


In [31]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Duration,169.0,63.846154,42.299949,15.0,45.0,60.0,60.0,300.0
Pulse,169.0,107.461538,14.510259,80.0,100.0,105.0,111.0,159.0
Maxpulse,169.0,134.047337,16.450434,100.0,124.0,131.0,141.0,184.0
Calories,164.0,375.790244,266.379919,50.3,250.925,318.6,387.6,1860.4


> ``This is the short notebbok to get femilier with Pandas. In the next notebook we will work on real time data set to do data analysis. ``

## References
* https://github.com/donnemartin/data-science-ipython-notebooks/tree/master/pandas
* https://github.com/LearnDataSci/articles/tree/master/Python%20Pandas%20Tutorial%20A%20Complete%20Introduction%20for%20Beginners