# Pandas
Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series.

## Install and Import

In [None]:
pip install pandas

In [1]:
import pandas as pd

To check version of pandas

In [2]:
pd.__version__

'1.2.4'

##  Data Structures 
Pandas generally provide two data structures for manipulating data, They are: 

- Series
- DataFrame

### Series
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.

In [3]:
data = [0.25,0.50,0.75,1]

In [4]:
data = pd.Series(data)

In [5]:
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [6]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

#### Index
there are 2 kinds of indexes in the series, namely: 
- implicit 
- explicit

##### implicit
implicit is the default index starting from 0

In [8]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [9]:
data[1]

0.5

##### explicit
explicit is an index that we can name ourselves against that index

In [10]:
data = pd.Series([0.25,0.50,0.75,1], index=['a','b','c','d'])

In [11]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [12]:
data['b'] #index explicit

0.5

In [13]:
data[1] #index implicit

0.5

Sliciing index

In [14]:
# Explicit slicing index takes from start to end
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [15]:
# implicit slicing index takes from start to end minus 1
data[0:2]

a    0.25
b    0.50
dtype: float64

##### loc and iloc
Indexing operator is used to refer to the square brackets following an object. The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing operator to refer to df[ ].

##### loc
This function selects data by refering the explicit index . The df.loc indexer selects data in a different way than just the indexing operator. It can select subsets of data.

In [16]:
data_2 = pd.Series([0.25,0.50,0.75,1], index=[2,5,3,7])

In [17]:
data_2

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [18]:
data_2[2] # Explicit

0.25

In [19]:
data_2[2:3] #implicit

3    0.75
dtype: float64

In [20]:
data_2.loc[3] # loc is index explicit

0.75

In [21]:
data_2.iloc[3] # iloc is index implicit

1.0

In [22]:
data_2.loc[2:3] # Explicit

2    0.25
5    0.50
3    0.75
dtype: float64

In [23]:
data_2.iloc[2:3] # implicit

3    0.75
dtype: float64

### DataFrame
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

Create DataFrame using 2 dictionary

In [24]:
dict_population = {'Makassar' :300,
                 'Jakarta' : 550,
                 'Depok' : 350,
                 'Dayak' : 320,
                 'Sorong' : 280}

In [25]:
dict_population

{'Makassar': 300, 'Jakarta': 550, 'Depok': 350, 'Dayak': 320, 'Sorong': 280}

In [26]:
population = pd.Series(dict_population)

In [27]:
population

Makassar    300
Jakarta     550
Depok       350
Dayak       320
Sorong      280
dtype: int64

In [40]:
population.loc['Depok'] # index explicit using loc

350

In [41]:
population.iloc[2] # index implicit using iloc

350

In [28]:
dict_area= {
                'Makassar' :800,
                 'Jakarta' : 1100,
                 'Depok' : 500,
                 'Dayak' : 600,
                 'Sorong' : 650
    
}

In [29]:
dict_area

{'Makassar': 800, 'Jakarta': 1100, 'Depok': 500, 'Dayak': 600, 'Sorong': 650}

In [30]:
area = pd.Series(dict_area)

In [31]:
district = pd.DataFrame({'pop':population, 'area':area})

In [32]:
district

Unnamed: 0,pop,area
Makassar,300,800
Jakarta,550,1100
Depok,350,500
Dayak,320,600
Sorong,280,650


In [34]:
# Show value of Dayak in area column
district['area']['Dayak']

600

In [35]:
city = pd.Series(['Jakarta','Makassar','Bali', 'Surabaya','Bandung'], index=['Azkia','Arnesta','Arini','Nabila','Vegatama'])

In [36]:
hobby = pd.Series(['Reading','Watching','Drawing', 'Reading','Reading'],index=['Azkia','Arnesta','Arini','Nabila','Vegatama'])

In [37]:
dream = pd.Series(['Data Scientist','IT','Entrepreneur', 'Data Scientist','Data Scientist'],index=['Azkia','Arnesta','Arini','Nabila','Vegatama'])

In [38]:
Team4= pd.DataFrame({'Domicile':city,'Hobby':hobby,'Dream':dream})

In [39]:
Team4

Unnamed: 0,Domicile,Hobby,Dream
Azkia,Jakarta,Reading,Data Scientist
Arnesta,Makassar,Watching,IT
Arini,Bali,Drawing,Entrepreneur
Nabila,Surabaya,Reading,Data Scientist
Vegatama,Bandung,Reading,Data Scientist


## Read_csv
To access data from the CSV file, we require a function read_csv() that retrieves data in the form of the Dataframe.

import data titanic.csv using pd.read_csv and save it to df(dataframe)

In [43]:
df = pd.read_csv('titanic.csv')

head() method is used to return top n (5 by default) rows of a data frame or series.

In [45]:
df.head() 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [46]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


 tail() method is used to return bottom n (5 by default) rows of a data frame or series.

In [47]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [48]:
df.tail(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


info() method prints information about the DataFrame.

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


 describe() is used to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values.

In [50]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


unique() method returns an array containing unique values on a dataset.

In [51]:
df['Age'].unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

nunique() method, returning the number of such unique values, it is worth noting the NaN value will not be calculated.

In [59]:
df['Age'].nunique()

88

value_counts () in pandas is used to return the count of occurance of each value in a dataframe or in a particular column.

In [52]:
df['Age'].value_counts()

24.00    30
22.00    27
18.00    26
28.00    25
19.00    25
         ..
55.50     1
74.00     1
0.92      1
70.50     1
12.00     1
Name: Age, Length: 88, dtype: int64

 min() method finds the minimum of the values in the object and returns it. 

In [54]:
df['Age'].min()

0.42

max() method finds the maximum of the values in the object and returns it. 

In [55]:
df['Age'].max()

80.0

count() method counts the number of not empty values for each row, or column

In [56]:
df['Age'].count()

714

 mean () method in pandas shows the flexibility of applying a mean operation over every value in the data frame in a most optimized way. 

In [61]:
df['Age'].mean()

29.69911764705882

isna () and isnull () functions are used to find the missing values in the pandas dataframe.

In [58]:
df['Age'].isnull()

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool

In [60]:
df['Age'].isna()

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool