<h1 align="center" style="color:red;">TELESOFT</h1>
<h1 align="center" style="color:green;">PANDAS BASICS</h1>
<h3 align="center" style="color:green;">TELESOFTAI : Zephania Reuben</h1>

#### What is Pandas?
  - **Pandas** is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
  
  - The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.
  - Data frames are tabular, meaning that they are based on rows and columns like you would see in a spreadsheet.
  - pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
  
#### Here are just a few of the things that pandas does well:
 - Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
 - Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
 - Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user cansimply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
 - Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
 - Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
 - Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
 - Intuitive merging and joining data sets
 - Flexible reshaping and pivoting of data sets


#### Pandas Installation
- conda environment 
 conda install pandas
- Installing from PyPI
   - python -m pip install pandas
- **Installing pandas on Linux**
 - In the following table, we will present some of the common Linux distributions package names for Matplotlib and the tools we can use to install the package:
 
**Distribution**        |       **Package Name**
----------------------  |  -------------------------------
Debian or Ubuntu (And other Debian derivatives)                 |  <code>sudo apt-get install python3-pandas</code>
Fedora                                                          |  <code>sudo dnf install python3-pandas</code>
Red hat                                                         |  <code>sudo yum install python3-pandas</code>
Centos/RHEL                                                            |  <code>sudo dnf install python3-pandas</code>


 #### 1.Understanding a pandas DataFrame
  - a pandas DataFrame (in a Jupyter Notebook) appears to be nothing more than an ordinary table of data consisting of rows and columns. Hiding beneath the surface are the three components--the index, columns, and data (also known as values) that you must be aware of in order to maximize the DataFrame's full potential.
  - Analyze the labeled anatomy of the DataFrame:
  - **Note**
   - In this Notebook we will be using a **Titanic** dataset.A dataset about passengers in Titanic.

![GitHub Logo](Data/DataFrameDescription.png)

**The variables that describe the passengers are:**

- **PassengerId**: and id given to each traveller on the boat.
- **Pclass**: the passenger class. It has three possible values: 1,2,3.
- **The Name**: a word or set of words by which a person or thing is usually known.
- **The Sex**: males or females considered as separate groups.
- **The Age**: the number of years that someone has lived.
- **SibSp**: number of siblings and spouses traveling with the passenger.
- **Parch**: number of parents and children traveling with the passenger.
- **The ticket number**: a number (identifier) piece of paper that shows you have paid for a journey.
- **The ticket Fare**: amount paid for a ticket.
- **The cabin number**: a number for private room on a ship for a passenger.
- **The embarkation**: It has three possible values S,C,Q

- A DataFrame has two axes: a **vertical axis** (the index) and a **horizontal axis**(the columns). Pandas borrows convention from NumPy and uses the integers 0/1 as another way of referring to the vertical/horizontal axis.

In [1]:
#Load library
import pandas as pd

#Create url

url = 'Data/Titanic.csv'

# Load data as a DataFrame
dataframe = pd.read_csv(url)

# Show first 5 rows
dataframe.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Things to notice in this DataFrame

- First, in a data frame each row corresponds to one observation (e.g., a passenger) and each column corresponds to one feature (gender, age, etc.). For example, bylooking at the first observation we can see that **Heikkinen, Miss. Laina** stayed in first class, was 26 years old, was female, and survived the disaster.
- Second, each column contains a name (e.g., Name, PClass, Age) and each rowcontains an index number (e.g., 0 for the lucky Miss Elisabeth Walton Allen). We will use these to select and manipulate observations and features.

#### 2.Creating a DataFrame
 - First method :
   - Create a dataframe and add columns independently.

In [2]:
#Load library
import pandas as pd

# Create a DataFrame
df = pd.DataFrame()

#Add columns to a DataFrame

df['Name'] = ['John','Rebecca','Lisa','Godfrey','Vivan']

df['Age'] = [19,16,27,18,91]

df['Country'] = ['Kenya','Uganda','Rwanda','Tanzania','Burundi']

#show DataFrame

df


Unnamed: 0,Name,Age,Country
0,John,19,Kenya
1,Rebecca,16,Uganda
2,Lisa,27,Rwanda
3,Godfrey,18,Tanzania
4,Vivan,91,Burundi


- Second method :
   - Create a dataframe and add columns at the same time.

In [3]:
#Load library
import pandas as pd

# Create a DataFrame
df = pd.DataFrame(columns=['Name','Age','Country'],
                  data=[
                       ['John',19,'Kenya'],
                       ['Rebecca',16,'Uganda'],
                       ['Lisa',19,'Rwanda'],
                       ['Godfrey',19,'Tanzania'],
                       ['Vivan',19,'Burundi']
                      ])

#show DataFrame
df

Unnamed: 0,Name,Age,Country
0,John,19,Kenya
1,Rebecca,16,Uganda
2,Lisa,19,Rwanda
3,Godfrey,19,Tanzania
4,Vivan,19,Burundi


#### 3.Creating a Series

In [4]:
#Load library
import pandas as pd

#Create a Series
series = pd.Series(index=['Name','Age','Country'],data=['John',19,'Uganda'])

#show series
series

Name         John
Age            19
Country    Uganda
dtype: object

#### A series can be used to create a DataFrame as follows

In [5]:
#Load library
import pandas as pd

#Create a DataFrame

df = pd.DataFrame().append(series,ignore_index=True)

#show DataFrame
df

Unnamed: 0,Age,Country,Name
0,19.0,Uganda,John


#### 4.Describing a DataFrame
- Describing a DataFrame involve looking at its short summary of descriptive statistical measures.

In [6]:
#Load library
import pandas as pd

#Create url
url = 'Data/Titanic.csv'

#Load data as a DataFrame
dataframe = pd.read_csv(url)

# show statistics
dataframe.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


- We can also take a look at the number of row and colums

In [7]:
dataframe.shape

(891, 12)

- DataFrame has 891 rows(instances/samples) and 12 colums(features)

#### 5.Navigating DataFrames
   - You need to select individual data or slices of a DataFrame
    - **loc**
      - is useful when the index of the DataFrame is a label (e.g., a string).
    - **iloc**
      - works by looking for the position in the DataFrame. For example, iloc[0] will return the first row regardless of whether the index is an integer or a label.

In [8]:
# Select three rows
dataframe.iloc[1:4] # also dataframe.iloc[:4]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


- DataFrames do not need to be numerically indexed. We can set the index of a DataFrame to any value where the value is unique to each row. For example, we can set the index to be passenger names and then select rows using a name:

In [9]:
#set index 

dataframe = dataframe.set_index(dataframe['Name'])

#use index to slice and show row
dataframe.loc['Heikkinen, Miss. Laina']

PassengerId                         3
Survived                            1
Pclass                              3
Name           Heikkinen, Miss. Laina
Sex                            female
Age                                26
SibSp                               0
Parch                               0
Ticket               STON/O2. 3101282
Fare                            7.925
Cabin                             NaN
Embarked                            S
Name: Heikkinen, Miss. Laina, dtype: object

#### 6.Selecting Rows Based on Conditionals
 - Suppose we want to select all women in Titanic

In [10]:
# Load library
import pandas as pd

# Create URL
url = 'Data/Titanic.csv'

# Load data
dataframe = pd.read_csv(url)

# Show top two rows where column 'sex' is 'female'
dataframe[dataframe['Sex'] == 'female'].head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


- Multiple conditions are easy as well. For example, here we select all the rows where the passenger is a female 65 or older:


In [11]:
# Show top two rows where column 'sex' is 'female' and 'age' >=27
dataframe[(dataframe['Sex'] == 'female') & (dataframe['Age'] >= 27) ].head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


#### 7.Replacing Values
 - pandas’ replace is an easy way to find and replace values. For example, we can replace any instance of "female" in the Sex column with "Woman":

In [12]:
# Load library
import pandas as pd

# Create URL
url = 'Data/Titanic.csv'

# Load data
dataframe = pd.read_csv(url)

# Replace values, show two rows
dataframe['Sex'].replace("female", "Woman").head(2)

0     male
1    Woman
Name: Sex, dtype: object

- We can also replace multiple values at the same time:

In [13]:
# Replace "female" and "male with "Woman" and "Man"
dataframe['Sex'].replace(["female", "male"], ["Woman", "Man"]).head(5)

0      Man
1    Woman
2    Woman
3    Woman
4      Man
Name: Sex, dtype: object

#### 8.Renaming Columns
- Rename columns using the rename method:

In [14]:
# Load library
import pandas as pd

# Create URL
url = 'Data/Titanic.csv'

# Load data
dataframe = pd.read_csv(url)

# Rename column, show two rows
dataframe.rename(columns={'Pclass': 'Passenger Class'}).head(2)

Unnamed: 0,PassengerId,Survived,Passenger Class,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


- Notice that the rename method can accept a dictionary as a parameter. We can use the dictionary to change multiple column names at once:

In [15]:
# Rename columns, show two rows
dataframe.rename(columns={'Pclass': 'Passenger Class', 'Sex': 'Gender'}).head(2)

Unnamed: 0,PassengerId,Survived,Passenger Class,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


#### 9.Finding the Minimum, Maximum, Sum,Average, and Count

In [16]:
# Load library
import pandas as pd
# Create URL
url = 'Data/Titanic.csv'
# Load data
dataframe = pd.read_csv(url)
# Calculate statistics
print('Maximum:', dataframe['Age'].max())
print('Minimum:', dataframe['Age'].min())
print('Mean:', dataframe['Age'].mean())
print('Sum:', dataframe['Age'].sum())
print('Count:', dataframe['Age'].count())

Maximum: 80.0
Minimum: 0.42
Mean: 29.69911764705882
Sum: 21205.17
Count: 714


#### 10.Finding Unique Values
 - Use unique to view an array of all unique values in a column: 

In [17]:
# Load library
import pandas as pd

# Create URL
url = 'Data/Titanic.csv'

# Load data
dataframe = pd.read_csv(url)

# Select unique values
dataframe['Sex'].unique()

array(['male', 'female'], dtype=object)

- Alternatively, value_counts will display all unique values with the number of times each value appears:

In [18]:
dataframe['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

#### 11.Handling Missing Values
 - isnull and notnull return booleans indicating whether a value is missing:

In [19]:
# Load library
import pandas as pd

# Create URL
url = 'Data/Titanic.csv'

# Load data
dataframe = pd.read_csv(url)

## Select missing values, show two rows
dataframe[dataframe['Age'].isnull()].head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S


#### 12.Deleting a Column

 - The best way to delete a column is to use drop with the parameter axis=1 (i.e., the column axis):

In [20]:
# Load library
import pandas as pd

# Create URL
url = 'Data/Titanic.csv'

# Load data
dataframe = pd.read_csv(url)

# Delete column
dataframe.drop('Age', axis=1).head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C85,C


- You can also use a list of column names as the main argument to drop multiple columns at once:

In [21]:
# Drop columns
dataframe.drop(['Age', 'Sex'], axis=1).head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,0,PC 17599,71.2833,C85,C


#### 13.Deleting a Row
 - Use a boolean condition to create a new DataFrame excluding the rows you want to delete:

In [22]:
# Load library
import pandas as pd

# Create URL
url = 'Data/Titanic.csv'

# Load data
dataframe = pd.read_csv(url)

# Delete rows, show first two rows of output
dataframe[dataframe['Sex'] != 'male'].head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


#### 14.Dropping Duplicate Rows
- Use drop_duplicates, but be mindful of the parameters:

In [23]:
# Load library
import pandas as pd
# Create URL
url = 'Data/Titanic.csv'
# Load data
dataframe = pd.read_csv(url)
# Drop duplicates, show first two rows of output
dataframe.drop_duplicates(keep='last').head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


#### 15.Grouping Rows by Values
- groupby is one of the most powerful features in pandas:

In [24]:
# Load library
import pandas as pd

# Create URL
url = 'Data/Titanic.csv'

# Load data
dataframe = pd.read_csv(url)

# Group rows by the values of the column 'Sex', calculate mean
# of each group
dataframe.groupby('Sex').mean()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,431.028662,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818
male,454.147314,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893


#### 15.Concatenating DataFrames
 - Use concat with axis=0 to concatenate along the row axis:

In [25]:
# Load library
import pandas as pd

# Create DataFrame
data_a = {'id': ['1', '2', '3'],
'first': ['Alex', 'Amy', 'Allen'],
'last': ['Anderson', 'Ackerman', 'Ali']}
dataframe_a = pd.DataFrame(data_a, columns = ['id', 'first', 'last'])

# Create DataFrame
data_b = {'id': ['4', '5', '6'],
'first': ['Billy', 'Brian', 'Bran'],
'last': ['Bonder', 'Black', 'Balwner']}
dataframe_b = pd.DataFrame(data_b, columns = ['id', 'first', 'last'])

# Concatenate DataFrames by rows
pd.concat([dataframe_a, dataframe_b], axis=0)

Unnamed: 0,id,first,last
0,1,Alex,Anderson
1,2,Amy,Ackerman
2,3,Allen,Ali
0,4,Billy,Bonder
1,5,Brian,Black
2,6,Bran,Balwner


<h4 align="center">Write to: telesoftai@gmail.com</h4>