# Pandas
***

Pandas is an open source, high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Pandas adds data structures and tools designed to work with table-like data which is Series and Data Frames. Pandas provides tools for data manipulation:

- reshaping
- merging
- sorting
- slicing
- aggregation
- imputation. If you are using anaconda, you do not have install pandas.

## Installing Pandas

- For Windows:
  - pip install conda
  - pip install pandas

Pandas data structure is based on Series and DataFrames. A series is a column and a DataFrame is a multidimensional table made up of collection of series. In order to create a pandas series we should use numpy to create a one dimensional arrays or a python list. Let us see an example of a series:

## Importing Pandas

In [2]:
import pandas as pd # importing pandas as pd
import numpy  as np # importing numpy as np

## Creating Pandas Series with Default Index

In [3]:
nums = [1, 2, 3, 4,5]
s = pd.Series(nums)
print(s)

0    1
1    2
2    3
3    4
4    5
dtype: int64


## Creating Pandas Series with custom index

In [4]:
nums = [1, 2, 3, 4, 5]
s = pd.Series(nums, index=[1, 2, 3, 4, 5])
print(s)

1    1
2    2
3    3
4    4
5    5
dtype: int64


In [5]:
fruits = ['Orange','Banana','Mango']
fruits = pd.Series(fruits, index=[1, 2, 3])
print(fruits)

1    Orange
2    Banana
3     Mango
dtype: object


## Creating Pandas Series from a Dictionary


In [6]:
dct = {'name':'Ruben','country':'Spain','city':'Leon'}

In [7]:
s = pd.Series(dct)
print(s)

name       Ruben
country    Spain
city        Leon
dtype: object


## Creating a Constant Pandas Series


In [8]:
s = pd.Series(10, index = [1, 2, 3])
print(s)

1    10
2    10
3    10
dtype: int64


## Creating a Pandas Series Using Linspace

In [9]:
s = pd.Series(np.linspace(5, 20, 10)) # linspace(starting, end, items)
print(s)

0     5.000000
1     6.666667
2     8.333333
3    10.000000
4    11.666667
5    13.333333
6    15.000000
7    16.666667
8    18.333333
9    20.000000
dtype: float64


# DataFrames
***
Pandas data frames can be created in different ways.

## Creating DataFrames from list of lists

In [10]:
data = [
    ['Ruben', 'Spain', 'Leon'], 
    ['David', 'UK', 'London'],
    ['John', 'Sweden', 'Stockholm']
]
df = pd.DataFrame(data, columns=['Names','Country','City'])
print(df)

   Names Country       City
0  Ruben   Spain       Leon
1  David      UK     London
2   John  Sweden  Stockholm


## Creating DataFrame Using Dictionary

In [12]:
data = {'Name': ['Asabeneh', 'David', 'John'], 'Country':[
    'Finland', 'UK', 'Sweden'], 'City': ['Helsiki', 'London', 'Stockholm']}
df = pd.DataFrame(data)
print(df)

       Name  Country       City
0  Asabeneh  Finland    Helsiki
1     David       UK     London
2      John   Sweden  Stockholm


## Creating DataFrames from a List of Dictionaries


In [11]:
data = [
    {'Name': 'Asabeneh', 'Country': 'Finland', 'City': 'Helsinki'},
    {'Name': 'David', 'Country': 'UK', 'City': 'London'},
    {'Name': 'John', 'Country': 'Sweden', 'City': 'Stockholm'}]
df = pd.DataFrame(data)
print(df)

       Name  Country       City
0  Asabeneh  Finland   Helsinki
1     David       UK     London
2      John   Sweden  Stockholm


# Reading CSV file using Pandas
***

In [13]:
import pandas as pd

df = pd.read_csv('../data/weight-height.csv')
print(df)

      Gender     Height      Weight
0       Male  73.847017  241.893563
1       Male  68.781904  162.310473
2       Male  74.110105  212.740856
3       Male  71.730978  220.042470
4       Male  69.881796  206.349801
...      ...        ...         ...
9995  Female  66.172652  136.777454
9996  Female  67.067155  170.867906
9997  Female  63.867992  128.475319
9998  Female  69.034243  163.852461
9999  Female  61.944246  113.649103

[10000 rows x 3 columns]


## Data exploration

In [14]:
print(df.head()) # give five rows we can increase the number of rows by passing argument to the head() method

  Gender     Height      Weight
0   Male  73.847017  241.893563
1   Male  68.781904  162.310473
2   Male  74.110105  212.740856
3   Male  71.730978  220.042470
4   Male  69.881796  206.349801


Let us also explore the last recordings of the dataframe using the tail() methods.

In [15]:
print(df.tail()) # tails give the last five rows, we can increase the rows by passing argument to tail method

      Gender     Height      Weight
9995  Female  66.172652  136.777454
9996  Female  67.067155  170.867906
9997  Female  63.867992  128.475319
9998  Female  69.034243  163.852461
9999  Female  61.944246  113.649103


As you can see the csv file has three rows: Gender, Height and Weight. If the DataFrame would have a long rows, it would be hard to know all the columns. Therefore, we should use a method to know the colums. we do not know the number of rows. Let's use shape meathod.

In [16]:
print(df.shape) # as you can see 10000 rows and three columns

(10000, 3)


Let us get all the columns using columns.

In [18]:
print(df.columns)

Index(['Gender', 'Height', 'Weight'], dtype='object')


Now, let us get a specific column using the column key

In [20]:
heights = df['Height'] # this is now a series
print(heights)

0       73.847017
1       68.781904
2       74.110105
3       71.730978
4       69.881796
          ...    
9995    66.172652
9996    67.067155
9997    63.867992
9998    69.034243
9999    61.944246
Name: Height, Length: 10000, dtype: float64


In [22]:
weights = df['Weight'] # this is now a series
print(weights)

0       241.893563
1       162.310473
2       212.740856
3       220.042470
4       206.349801
           ...    
9995    136.777454
9996    170.867906
9997    128.475319
9998    163.852461
9999    113.649103
Name: Weight, Length: 10000, dtype: float64


In [23]:
print(len(heights) == len(weights))

True


The describe() method provides a descriptive statistical values of a dataset.

In [24]:
print(heights.describe()) # give statisical information about height data

count    10000.000000
mean        66.367560
std          3.847528
min         54.263133
25%         63.505620
50%         66.318070
75%         69.174262
max         78.998742
Name: Height, dtype: float64


In [25]:
print(weights.describe())

count    10000.000000
mean       161.440357
std         32.108439
min         64.700127
25%        135.818051
50%        161.212928
75%        187.169525
max        269.989699
Name: Weight, dtype: float64


In [26]:
print(df.describe())  # describe can also give statistical information from a dataFrame

             Height        Weight
count  10000.000000  10000.000000
mean      66.367560    161.440357
std        3.847528     32.108439
min       54.263133     64.700127
25%       63.505620    135.818051
50%       66.318070    161.212928
75%       69.174262    187.169525
max       78.998742    269.989699


Similar to describe(), the info() method also give information about the dataset.