### Pandas
- Pandas is an open-source python library that uses powerful data structures to provide high-performance data manipulation and analysis.
- It provides a variety of data structures and operations for manipulating numerical data and time series.
- This library is based on the NumPy library.

Pandas Objects
- Pandas objects can be thought of as enhanced version of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.
- There are three fundamental pandas data structures:<br>
    1. Series<br>
    2. DataFrame<br>
    3. Index

Series
- Pandas series is a labelled one-dimensional array that can hold any type of data(integer, string, float, python object, and so on).
- Pandas series is simply a columns in an excel spreadsheet.
- Using the Series() method, we can easily convert a list, tuple, or dictionary into a series.

Creating a series

In [3]:
import pandas as pd
import numpy as np

myser=pd.Series()
print(myser)

data=np.array(['g','e','e','k','s'])
print(data)
myser=pd.Series(data)
print(myser)
myser[0]

Series([], dtype: object)
['g' 'e' 'e' 'k' 's']
0    g
1    e
2    e
3    k
4    s
dtype: object


'g'

Creating a series from lists

In [5]:
mylist=[10,20,30,40,50]
print(type(mylist))
myseri=pd.Series(mylist)
print(myseri)
print(type(myseri))

<class 'list'>
0    10
1    20
2    30
3    40
4    50
dtype: int64
<class 'pandas.core.series.Series'>


Pandas Index
- Pandas index is an efficient tool for extracting particular rows and columns of data from a DataFrame.
- Its job is to organise data and make it easily accessible.

Creating Index

In [9]:
data=pd.read_csv("airlines.csv")
print(data)

                                              Name IATA ICAO         Callsign  \
0                                   Private flight    -  NaN              NaN   
1                                      135 Airways  NaN  GNL          GENERAL   
2                                    1Time Airline   1T  RNX          NEXTIME   
3     2 Sqn No 1 Elementary Flying Training School  NaN  WYT              NaN   
4                                  213 Flight Unit  NaN  TFU              NaN   
...                                            ...  ...  ...              ...   
6156                                   GX Airlines  NaN  CBG            SPRAY   
6157                        Lynx Aviation (L3/SSX)  NaN  SSX           Shasta   
6158                               Jetgo Australia   JG   \N              NaN   
6159                                  Air Carnival   2S   \N              NaN   
6160                                 Svyaz Rossiya   7R  SJM  RussianConnecty   

             Country Active

In [10]:
# head() method give first 5 rows data from file
data.head()

Unnamed: 0,Name,IATA,ICAO,Callsign,Country,Active
0,Private flight,-,,,,Y
1,135 Airways,,GNL,GENERAL,United States,N
2,1Time Airline,1T,RNX,NEXTIME,South Africa,Y
3,2 Sqn No 1 Elementary Flying Training School,,WYT,,United Kingdom,N
4,213 Flight Unit,,TFU,,Russia,N


In [11]:
# tail() method give last 5 rows data from file
data.tail()

Unnamed: 0,Name,IATA,ICAO,Callsign,Country,Active
6156,GX Airlines,,CBG,SPRAY,China,Y
6157,Lynx Aviation (L3/SSX),,SSX,Shasta,United States,N
6158,Jetgo Australia,JG,\N,,Australia,Y
6159,Air Carnival,2S,\N,,India,Y
6160,Svyaz Rossiya,7R,SJM,RussianConnecty,Russia,Y


In [12]:
data.shape

(6161, 6)

In [13]:
data.columns

Index(['Name', 'IATA', 'ICAO', 'Callsign', 'Country', 'Active'], dtype='object')

In [33]:
data['Country']

0                  NaN
1        United States
2         South Africa
3       United Kingdom
4               Russia
             ...      
6156             China
6157     United States
6158         Australia
6159             India
6160            Russia
Name: Country, Length: 6161, dtype: object

In [34]:
data.isnull().sum()

Name           0
IATA        4627
ICAO          86
Callsign     808
Country       15
Active         0
dtype: int64

In [35]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6161 entries, 0 to 6160
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      6161 non-null   object
 1   IATA      1534 non-null   object
 2   ICAO      6075 non-null   object
 3   Callsign  5353 non-null   object
 4   Country   6146 non-null   object
 5   Active    6161 non-null   object
dtypes: object(6)
memory usage: 288.9+ KB


In [36]:
data.describe()

Unnamed: 0,Name,IATA,ICAO,Callsign,Country,Active
count,6161,1534,6075,5353,6146,6161
unique,6072,1121,5854,5262,277,3
top,Royal Air Force,1I,\N,Inc.,United States,N
freq,5,7,188,20,1099,4906


Pandas DataFrame
- Pandas has a two dimensional data structure with corresponding labels is known as DataFrame.
- Spreadsheets used in Excel or Calc or SQL tables are similar to DataFrames.
- Pandas DataFrame consists of three main components: the data, the index and the columns.

Creating a Pandas DataFrame

In [37]:
mylist2=['Venom','Scarlete','Thomas','James']
DF=pd.DataFrame(mylist2)
DF2=pd.DataFrame([10,20,30,40,50],columns=["Numbers"])
print(DF)
print(DF2)

          0
0     Venom
1  Scarlete
2    Thomas
3     James
   Numbers
0       10
1       20
2       30
3       40
4       50


Creating DataFrame from dictionary of ndarray / lists<br>
Note: To generate a DataFrame from a dict of ndarray/lists, each narray must be the same length.

In [38]:
mydict={'Name':['Carl','Keni','Stark','George'],'Age':[20,25,19,18]}
DF=pd.DataFrame(mydict)
print(DF)

     Name  Age
0    Carl   20
1    Keni   25
2   Stark   19
3  George   18


Reindexing
- Reindexing modifies the row and column labels of DataFrame.
- It denotes verifying that the data corresponds to a specific set of labels along an estabished axis.
- To reindex the DataFrame, use the reindex() function.
- Values in the new index that do not have matching records in the DataFrame are by default given the value NaN.

In [39]:
rein=pd.DataFrame({"P":[1,2,3,4,5],"Q":[5,6,7,8,9],"R":[12,13,14,15,16],"S":[17,18,19,20,16]},index=["A","B","C","D","E"])
rein

Unnamed: 0,P,Q,R,S
A,1,5,12,17
B,2,6,13,18
C,3,7,14,19
D,4,8,15,20
E,5,9,16,16


In [41]:
# Notice that the new indexes are populated with NaN values
rein.reindex(["1","2","3","4","5"])

Unnamed: 0,P,Q,R,S
1,,,,
2,,,,
3,,,,
4,,,,
5,,,,


In [42]:
# We can fill in the missing values using the fill_value parameter
rein.reindex(["a","b","c","d","e"],fill_value=25)

Unnamed: 0,P,Q,R,S
a,25,25,25,25
b,25,25,25,25
c,25,25,25,25
d,25,25,25,25
e,25,25,25,25


Pandas Sort

There are two kinds of sorting available in pandas:-
- By label
- By actual value

1. By label: When using the sort_index() method, DataFrame can be sorted by passing the axis arguments and the sorting order. Row labels are sorted by default in ascending order.

Sort the index

In [29]:
unsortedDF=pd.DataFrame(np.random.randn(8,2),index=[1,4,6,2,3,5,8,7],columns=['column2','column1'])
print(unsortedDF)
sortedDF=unsortedDF.sort_index()
print(sortedDF)

    column2   column1
1 -0.717842  0.929385
4 -0.430439 -0.003901
6  1.787624 -1.232624
2  1.585708 -0.315589
3 -0.307780  0.011777
5  0.021609  2.241835
8 -0.588367 -0.222236
7  0.200973 -1.643809
    column2   column1
1 -0.717842  0.929385
2  1.585708 -0.315589
3 -0.307780  0.011777
4 -0.430439 -0.003901
5  0.021609  2.241835
6  1.787624 -1.232624
7  0.200973 -1.643809
8 -0.588367 -0.222236


Sort the columns

In [22]:
sortedDF=unsortedDF.sort_index(axis=1)
print(sortedDF)

    column1   column2
1 -0.998267  0.491271
4 -0.064724 -0.288120
6 -1.114242 -1.223622
2 -0.517045 -0.638181
3  0.591919  1.675234
5  1.479163 -0.511360
8  0.405411 -0.633551
7 -1.175506 -0.492848


Order of Sorting

In [21]:
sortedDF=unsortedDF.sort_index(ascending=False)
print(sortedDF)

    column2   column1
8 -0.633551  0.405411
7 -0.492848 -1.175506
6 -1.223622 -1.114242
5 -0.511360  1.479163
4 -0.288120 -0.064724
3  1.675234  0.591919
2 -0.638181 -0.517045
1  0.491271 -0.998267


2. By actual value: Like index sorting, sort_values() is a method for sorting by values. It accepts a 'by' argument which will use the column name of the DataFrame to sort the values.

In [28]:
unsortedDF2=pd.DataFrame({'Column1':[2,1,1,1],'Column2':[1,3,2,4]})
sortedDF2=unsortedDF2.sort_values(by='Column2')
print(sortedDF2)

   Column1  Column2
0        2        1
2        1        2
1        1        3
3        1        4


Working with text data
- Working with string data is made simple by a set of string functions that are part of pandas.
- These functions ignore missing / NaN values.

In [36]:
ser=pd.Series(['Tommas','Jarvis','Sui','07',np.nan,'Parvez','Seri'])
print(ser.str.lower())

0    tommas
1    jarvis
2       sui
3        07
4       NaN
5    parvez
6      seri
dtype: object


In [31]:
print(ser.str.upper())

0    TOMMAS
1    JARVIS
2       SUI
3    PARVEZ
4      SERI
dtype: object


In [63]:
dataset={'Subject1':[80,np.nan,86,np.nan,90],'Subject2':[98,55,94,75,np.nan],'Subject3':[np.nan,98,96,np.nan,69]}
DF=pd.DataFrame(dataset)
print("My DataFrame :- \n",DF)
print("\nMaximum Marks Subject Wise:-\n",DF.max())

My DataFrame :- 
    Subject1  Subject2  Subject3
0      80.0      98.0       NaN
1       NaN      55.0      98.0
2      86.0      94.0      96.0
3       NaN      75.0       NaN
4      90.0       NaN      69.0

Maximum Marks Subject Wise:-
 Subject1    90.0
Subject2    98.0
Subject3    98.0
dtype: float64


In [64]:
print("Minimum Marks Subject Wise:-\n",DF.min())

Minimum Marks Subject Wise:-
 Subject1    80.0
Subject2    55.0
Subject3    69.0
dtype: float64


In [65]:
print("Median:-\n",DF.median())

Median:-
 Subject1    86.0
Subject2    84.5
Subject3    96.0
dtype: float64


In [66]:
print("Count of non-empty values:-\n",DF.count())

Count of non-empty values:-
 Subject1    3
Subject2    4
Subject3    3
dtype: int64


Indexing and Selecting data
- In pandas, selecting specific rows and columns of data fron a DataFrame constitutes indexing.
- Selecting all the rows and some of the columns, some of the rows and all the columns or a portion of each row and each column is what is referred to as indexing.
- Another term for indexing is subset selection.
- Pandas now supports three types of Multi-axes indexing.


Indexing a DataFrame using indexing operator[ ]

In [4]:
# making data frame frome csv file
data=pd.read_csv("nba.csv",index_col="Name")
# retrieving columns by indexing operator
first=data["Team"]
first

Name
Avery Bradley    Boston Celtics
Jae Crowder      Boston Celtics
John Holland     Boston Celtics
R.J. Hunter      Boston Celtics
Jonas Jerebko    Boston Celtics
                      ...      
Shelvin Mack          Utah Jazz
Raul Neto             Utah Jazz
Tibor Pleiss          Utah Jazz
Jeff Withey           Utah Jazz
NaN                         NaN
Name: Team, Length: 458, dtype: object

Indexing a DataFrame using loc method

In [5]:
# retrieving row by loc method
first=data.loc["Tyler Zeller"]
second=data.loc["Avery Bradley"]
print(first,"\n\n\n",second)

Team        Boston Celtics
Number                44.0
Position                 C
Age                   26.0
Height                 7-0
Weight               253.0
College     North Carolina
Salary           2616975.0
Name: Tyler Zeller, dtype: object 


 Team        Boston Celtics
Number                 0.0
Position                PG
Age                   25.0
Height                 6-2
Weight               180.0
College              Texas
Salary           7730337.0
Name: Avery Bradley, dtype: object


Indexing a DataFrame using indexing iloc method

In [6]:
# retrieving row by iloc method
first=data.iloc[25]
second=data.iloc[10]
print(first,"\n\n\n",second)

Team        Brooklyn Nets
Number               33.0
Position               PF
Age                  26.0
Height               6-10
Weight              220.0
College       Saint Louis
Salary           947276.0
Name: Willie Reed, dtype: object 


 Team        Boston Celtics
Number                 7.0
Position                 C
Age                   24.0
Height                 6-9
Weight               260.0
College         Ohio State
Salary           2569260.0
Name: Jared Sullinger, dtype: object
