# Pandas Crash Course

written by Mehdi Paydayesh on 25/04/2019


## What is Pandas?
The Pandas library is one of the most popular and widely used Python libraries in data science and python programming. It stands for stands for “Python Data Analysis Library” and it has been a game changer in performing data manipulation and analysis. 
It is built on the top of Numpy package which allows Python to read in datasets from various formats, such as CSV file, TSV file, or a SQL database. It offers data structures in ‘DataFrame’ that allow you to read, store, select and manipulate tabular data in rows of observations and columns of variables.  It also lets you quickly grab statistics such as the mean value of a column. 

## Why is it important?
Pandas offers flexible data structures and provides easy syntax and fast operations. In combination with libraries such as matplotlib for data visualization and NumPy for statistics provides a package that is extensively for scientific computing in Python. 

## What you see in this document?
I have put together a tutorial in a form of Notebook in my GitHub repository that covers the basics of Pandas. It covers Pandas DataFrames, basic data manipulations that includes code samples. This tutorial has been prepared for those who seek for a quick guide to dive into Pandas common functions of Pandas. It gives you essentials of that you need to kickstart your data analysis fun in Python!

## Requirements
The major libraries used:
1. Pandas

## How to install 
The easiest way to get Pandas is to install it through the Anaconda distribution. Alternatively you can install pandas with: "pip install pandas" or "conda instal pandas"


## File structure

The notebook file is called **PandasCrashCourse.ipynb** and includes the following sections:

**Part 0: importing libararies**

**Part 1: grabing a specific column**

**Part 2: grabing multiple columns of data**

**Part 3: Using conditioning filtering to select certian rows and columns **

**Part 4: grabing unique values**

**Part 5: grabing all the column names of the data frame**

**Part 6: reporting back the information about the data frame**

**Part 7: reporting the statitics in the data**

**Part 8: reporting the range of index**

**Part 9: Creating a panda data frame using some Numpy generated numbers**

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('salaries.csv')
print(df)

     Name  Salary  Age
0    John   50000   34
1   Sally  120000   45
2  Alyssa   80000   27


In [3]:
# grabing a specific column
a= df["Salary"]
print(a)

0     50000
1    120000
2     80000
Name: Salary, dtype: int64


In [5]:
# grabing multiple columns of data
b=df[['Name','Salary']]
print(b)

     Name  Salary
0    John   50000
1   Sally  120000
2  Alyssa   80000


In [7]:
# min, max and mean operations
a= df["Salary"].min()
print(a)
b= df["Salary"].max()
print(b)
c= df["Salary"].mean()
print(c)

50000
120000
83333.33333333333


In [11]:
# Using conditioning filtering to select certian rows and columns 
a=df["Age"]>30
print(a)
print (df[a]) # filtering 
print (df[df["Age"]>30]) # doing the filtering in one-step

0     True
1     True
2    False
Name: Age, dtype: bool
    Name  Salary  Age
0   John   50000   34
1  Sally  120000   45
    Name  Salary  Age
0   John   50000   34
1  Sally  120000   45


In [13]:
# grabing unique values
a=df["Age"].unique()
# grabing the number of unique values
b=df["Age"].nunique()
print (a)
print (b)

[34 45 27]
3


In [14]:
# grabing all the column names of the data frame
a=df.columns
print (a)

Index(['Name', 'Salary', 'Age'], dtype='object')


In [15]:
# reporting back the information about the data frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
Name      3 non-null object
Salary    3 non-null int64
Age       3 non-null int64
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes


In [16]:
# reporting the statitics in the data
df.describe()

Unnamed: 0,Salary,Age
count,3.0,3.0
mean,83333.333333,35.333333
std,35118.845843,9.073772
min,50000.0,27.0
25%,65000.0,30.5
50%,80000.0,34.0
75%,100000.0,39.5
max,120000.0,45.0


In [17]:
# reporting the range of index
df.index

RangeIndex(start=0, stop=3, step=1)

## Creating a panda data frame using some Numpy generated numbers

In [19]:
import numpy as np

In [24]:
mat=np.arange(0,10).reshape(5,2)
print (mat)

# converting to pandas data frame
df1=pd.DataFrame(data=mat)
print(df1)
# choosing column names for the data frame
df2=pd.DataFrame(data=mat, columns=["A","B"])
print(df2)

[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]
   0  1
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
