**Pandas Tutorial**

---

Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another package named Numpy, which provides support for multi-dimensional arrays. As one of the most popular data wrangling packages, Pandas works well with many other data science modules inside the Python ecosystem.


**Pandas is used to analyze data.**

It has functions for analyzing, cleaning, exploring, and manipulating data.Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

---

**Pandas DataFrames**

In [1]:
# importing pandas under pd alias
import pandas as pd

In [2]:
# Creating a sample dictionary
__student_markings = {
    'Name' : ['Avinash', 'Abhijeet', 'Elina', 'Rohit', 'Kamya', 'Laraib', 'Viran'],
    'Percentile' : [85, 92, 90, 82, 50, 85, 100]
}

# Creating a data frame for a dictionary '__student_markings'
data__framing = pd.DataFrame(__student_markings)
print(data__framing)

       Name  Percentile
0   Avinash          85
1  Abhijeet          92
2     Elina          90
3     Rohit          82
4     Kamya          50
5    Laraib          85
6     Viran         100


In the above data-framing indexing are introduced automatically. Pandas use the loc attribute to return one or more specified row(s)

In [3]:
print(data__framing.loc[3])

Name          Rohit
Percentile       82
Name: 3, dtype: object


Returning Multiple rows : To return multiple rows, just pass an array to the loc of the indexes to be returned

In [4]:
print(data__framing.loc[[0,2]])

      Name  Percentile
0  Avinash          85
2    Elina          90


The DataFrames object has a method called info(), that gives you more information about the data set.

Getting head values and Tail values :
The head() method returns the headers and a specified number of rows, starting from the top. The tail() method returns the headers and a specified number of rows, starting from the bottom.

In [5]:
print(data__framing.head(3))

       Name  Percentile
0   Avinash          85
1  Abhijeet          92
2     Elina          90


In [6]:
print(data__framing.tail(3))

     Name  Percentile
4   Kamya          50
5  Laraib          85
6   Viran         100


The DataFrames object has a method called info(), that gives you more information about the data set.

In [7]:
print(data__framing.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        7 non-null      object
 1   Percentile  7 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 240.0+ bytes
None


---

**Pandas Series**

A pandas is series is like a column in a table. It is a one-dimensional array.

In [8]:
# Creating an One-d array
__one_d_array = ['Avinash', 'B.Tech', 'Information Technology', 'Narula Institute of Technology', 'Kolkata']

# Creating a series of that array.
__a_series = pd.Series(__one_d_array)

print(__a_series)

0                           Avinash
1                            B.Tech
2            Information Technology
3    Narula Institute of Technology
4                           Kolkata
dtype: object


These indexes (0, 1, 2..) are called Labels. These helps us to retrieve a specific data. 

In [9]:
print(__a_series[0])

Avinash


We can create our own indexing by passing an indexing array.

In [10]:
__a_series = pd.Series(__one_d_array, index = ['a', 'b', 'c', 'd', 'e'])
print(__a_series)

a                           Avinash
b                            B.Tech
c            Information Technology
d    Narula Institute of Technology
e                           Kolkata
dtype: object


In [11]:
print(__a_series['e'])

Kolkata


Till now we have passed array as a series but we can also pass a **Dictionary as a Series** where key will act as an Label.

In [12]:
__a_dictionary = {
    'Avinash' : 85,
    'Abhijeet' : 92,
    'Rohit' : 82,
    'Elina' : 90
}
__a__dictionary_series = pd.Series(__a_dictionary)
print(__a__dictionary_series)

Avinash     85
Abhijeet    92
Rohit       82
Elina       90
dtype: int64


In [13]:
print(__a__dictionary_series['Avinash'])

85


---

**CSV Files**

CSV (Comma Separated Values) is a simple file format used to store tabular data, such as a spreadsheet or database. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. For working CSV files in python, there is an inbuilt module called csv.

For now we working with old data frames for better understanding. The data__framing is already created at the begining, now to converting that data frame into a csv file

In [14]:
# Converting data Frames to CSVs
data__framing.to_csv('Marks.csv')

In [15]:
# Removing indexs
data__framing.to_csv('Marks_removed_index.csv', index = False)

Reading existing CSV file in a data frame

In [16]:
a_new_data_frame = pd.read_csv('Marks.csv')
print(a_new_data_frame)

   Unnamed: 0      Name  Percentile
0           0   Avinash          85
1           1  Abhijeet          92
2           2     Elina          90
3           3     Rohit          82
4           4     Kamya          50
5           5    Laraib          85
6           6     Viran         100


In [17]:
print(a_new_data_frame.to_string())

   Unnamed: 0      Name  Percentile
0           0   Avinash          85
1           1  Abhijeet          92
2           2     Elina          90
3           3     Rohit          82
4           4     Kamya          50
5           5    Laraib          85
6           6     Viran         100


---

**Cleaning Data**