# Pandas
Pandas is an open-source library that is built on top of NumPy library. It is a Python package that offers various data structures and operations for manipulating numerical data and time series. It is mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-performance & productivity for users.

In [3]:
import pandas as pd
import numpy as np

In [3]:
np.arange(0,20).reshape(5,4)
# makes a 2-D array having 5 rows and 4 columns

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [4]:
## Create Dataframe
df=pd.DataFrame(data=np.arange(0,20).reshape(5,4),index=["Row1","Row2","Row3","Row4","Row5"],columns=["Column1","Column2",
                                                                                                    "Column3","Column4"])

In [5]:
df

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Row1 to Row5
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Column1  5 non-null      int32
 1   Column2  5 non-null      int32
 2   Column3  5 non-null      int32
 3   Column4  5 non-null      int32
dtypes: int32(4)
memory usage: 120.0+ bytes


In [7]:
type(df)

pandas.core.frame.DataFrame

In [8]:
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [9]:
df.describe()

Unnamed: 0,Column1,Column2,Column3,Column4
count,5.0,5.0,5.0,5.0
mean,8.0,9.0,10.0,11.0
std,6.324555,6.324555,6.324555,6.324555
min,0.0,1.0,2.0,3.0
25%,4.0,5.0,6.0,7.0
50%,8.0,9.0,10.0,11.0
75%,12.0,13.0,14.0,15.0
max,16.0,17.0,18.0,19.0


In [10]:
#creating a series
#shift+tab=description of cell
days=['monday','tuesday','wednesday','thursday','friday','saturday','sunday']

my_series=pd.Series(days)
print(my_series)

0       monday
1      tuesday
2    wednesday
3     thursday
4       friday
5     saturday
6       sunday
dtype: object


In [11]:
type(my_series)

pandas.core.series.Series

In [12]:
type(df)

pandas.core.frame.DataFrame

In [13]:
df

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


### Indexing
There are Three types of Indexing:
- columnname
- rowindex[loc]
- rowindex columnindex number[.iloc]

In [16]:
x=df["Column1"]
x

Row1     0
Row2     4
Row3     8
Row4    12
Row5    16
Name: Column1, dtype: int32

In [17]:
type(x)

pandas.core.series.Series

In [18]:
y=df[["Column1","Column2"]]
y

Unnamed: 0,Column1,Column2
Row1,0,1
Row2,4,5
Row3,8,9
Row4,12,13
Row5,16,17


In [19]:
type(y)

pandas.core.frame.DataFrame

We can clearly see the difference between Series and Dataframe
- if there is only 1 row or 1 column in a dataset, then it is Series
- if both rows,columns>1, then it is dataframe

In [20]:
##using row index name loc
df.loc[['Row3','Row4']]

Unnamed: 0,Column1,Column2,Column3,Column4
Row3,8,9,10,11
Row4,12,13,14,15


In [21]:
df.iloc[2:4,0:2]
# .iloc(row,column)

Unnamed: 0,Column1,Column2
Row3,8,9
Row4,12,13


In [22]:
##convert dataframe into arrays
df.iloc[:,1:].values

array([[ 1,  2,  3],
       [ 5,  6,  7],
       [ 9, 10, 11],
       [13, 14, 15],
       [17, 18, 19]])

In [23]:
df['Column2'].unique()
# count duplicate values one time

array([ 1,  5,  9, 13, 17])

In [24]:
#creating series using dictionary
dict={'1':'monday','2':'tuesday','3':'wednesday','4':'friday'}

ser=pd.Series(dict)
print(ser)

1       monday
2      tuesday
3    wednesday
4       friday
dtype: object


In [25]:
#creating a dataframe using dictionary
days=['monday','tuesday','wednesday','thursday','friday','saturday','sunday']
max_temp=[18,19,20,21,16,21,20]
min_temp=[10,13,12,8,7,11,12]
windspeed=[10,14,8,14,18,11,14]
event=['rain','thunderstorm','rain','sunny','fog','fog','rain']

weather={'Day':days,'Max Temperature':max_temp,'Min Temperature':min_temp,'Windspeed':windspeed,'Event':event}
df=pd.DataFrame(weather,index=range(1,8))
df

Unnamed: 0,Day,Max Temperature,Min Temperature,Windspeed,Event
1,monday,18,10,10,rain
2,tuesday,19,13,14,thunderstorm
3,wednesday,20,12,8,rain
4,thursday,21,8,14,sunny
5,friday,16,7,18,fog
6,saturday,21,11,11,fog
7,sunday,20,12,14,rain


In [26]:
# make a csv file of it
df.to_csv("weather.csv")

In [27]:
# Loading dataset
df1=pd.read_csv("weather.csv")
df1

Unnamed: 0.1,Unnamed: 0,Day,Max Temperature,Min Temperature,Windspeed,Event
0,1,monday,18,10,10,rain
1,2,tuesday,19,13,14,thunderstorm
2,3,wednesday,20,12,8,rain
3,4,thursday,21,8,14,sunny
4,5,friday,16,7,18,fog
5,6,saturday,21,11,11,fog
6,7,sunday,20,12,14,rain


### Let's revise pandas

#### 1. What are Pandas?
Pandas is an open-source Python library that is built on top of the NumPy library. It is made for working with relational or labelled data. It provides various data structures for manipulating, cleaning and analyzing numerical data. It can easily handle missing data as well. Pandas are fast and have high performance and productivity.

#### 2. What are the Different Types of Data Structures in Pandas?
The two data structures that are supported by Pandas are Series and DataFrames.
- Pandas Series: It is a one-dimensional labelled array that can hold data of any type. It is mostly used to represent a single column or row of data.
- Pandas DataFrame: It is a two-dimensional heterogeneous data structure. It stores data in a tabular form. Its three main components are data, rows, and columns.

#### 3. List Key Features of Pandas.
Pandas are used for efficient data analysis. The key features of Pandas are as follows:
- Fast and efficient data manipulation and analysis
- Provides time-series functionality
- Easy missing data handling
- Faster data merging and joining
- Flexible reshaping and pivoting of data sets
- Powerful group by functionality
- Data from different file objects can be loaded
- Integrates with NumPy