<a href="https://colab.research.google.com/github/stanstevo/data_science/blob/main/intro_to_pandas_series_and_dataframes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas Series and DataFrames

## Objectives

* Understand Pandas Series and DataFrames
* Creating Series and DataFrames
* Basic Operations with Series
* Exploring DataFrame Basics
* Selecting Data from DataFrames
* Applying Functions to Series and DataFrames

## Loading Libraries

In [1]:
# numpys - for arithmetic operations and high-level mathematical functions to operate on arrays
import numpy as np
# pandas - for working with relational or labeled data
import pandas as pd

## What is a Pandas Series?

* **One-Dimensional** labeled Array capable of holding data on any type such as *intergers*, *string*, *float*, *python objects* etc.
* A pandas series is like a column in a table.


### Key features of a Pandas Series

* **Homogeneous Data**: A Series Holds Data of a single data type(integer, float, string etc), ensuring homogeneity within the Series.
* **Labeled Index**: Each element in a Series is associated with a label called an *index*. Having unique labels is a common practice, though not strictly required. The labels just need to be hashable types, ie they need to be used as keys in a dictionary. This index allows for easy and efficient data retrieval and manipulation.
* **Vectorized Operations**: - Series support vectorized operations, ie you can apply operations to the entire series without the need for explicit loops.
* **Alignment of Data**: - When performing operations on a Series, Pandas automatically aligns data based on index labels, which simplifies data manipulation.
* **Creation**: - Can be created from a List, NumpyArrays, Dictionary, DataFrame slice and other data sources.

In [None]:
# example of a series from a list
marks = [10, 20, 33, 42, 19, 30]

# series
marks_series = pd.Series(marks)
marks_series

0    10
1    20
2    33
3    42
4    19
5    30
dtype: int64

## Creating and Displaying

In [None]:
# example 1 - Creating a series from a list
data = [10.5, 11.2, 10.7, 9.9, 10.2]

# series
list_series = pd.Series(data, name="Student Marks")
list_series

0    10.5
1    11.2
2    10.7
3     9.9
4    10.2
Name: Student Marks, dtype: float64

In [None]:
# data type
type(list_series)

pandas.core.series.Series

In [None]:
# example 2 - Creating a series from a NumPy Array
data_arr = np.array(data) # created an array from a list

type(data_arr)

numpy.ndarray

In [None]:
# series from array
arr_series = pd.Series(data_arr, name="Array Series")
arr_series

0    10.5
1    11.2
2    10.7
3     9.9
4    10.2
Name: Array Series, dtype: float64

In [None]:
# example 3 - Series dictionary
data_dict = {
    "Prof" : 100,
    "Dominic" : 250,
    "Carol" : 300,
    "Eve" : 450
}

type(data_dict)

dict

In [None]:
# series from dict
dict_series = pd.Series(data_dict, name="Sky Team")
dict_series

Prof       100
Dominic    250
Carol      300
Eve        450
Name: Sky Team, dtype: int64

In [None]:
# series with custom index labels
balance = [1000, 1500, 2000, 4000] # data to store in the series
custom_labels = ['A', 'B', 'C', 'D'] # custom indexes

custom_label_series = pd.Series(data = balance, index=custom_labels, name='Balances')
custom_label_series

A    1000
B    1500
C    2000
D    4000
Name: Balances, dtype: int64

## Basic Operations With Series

In [None]:
arr_series

0    10.5
1    11.2
2    10.7
3     9.9
4    10.2
Name: Array Series, dtype: float64

In [None]:
# accessing elements in a series
print(arr_series[3])

9.9


In [None]:
dict_series

Prof       100
Dominic    250
Carol      300
Eve        450
Name: Sky Team, dtype: int64

In [None]:
# accessing elements in a series
print(dict_series['Carol'])

300


In [None]:
custom_label_series

A    1000
B    1500
C    2000
D    4000
Name: Balances, dtype: int64

In [None]:
# accessing elemets in a series
print(custom_label_series['B':'D'])

B    1500
C    2000
D    4000
Name: Balances, dtype: int64


In [None]:
# arithmetic operations
# convert balances into percentages
x = custom_label_series / 100
x

A    10.0
B    15.0
C    20.0
D    40.0
Name: Balances, dtype: float64

In [None]:
# filter elements
x_filtered = x[x >= 15]
x_filtered

B    15.0
C    20.0
D    40.0
Name: Balances, dtype: float64

In [None]:
# basic summary statistics
x

A    10.0
B    15.0
C    20.0
D    40.0
Name: Balances, dtype: float64

In [None]:
# mean
mean = x.mean()
print(mean)

21.25


In [None]:
# std
std = x.std()
print(std)

13.149778198382917


In [None]:
# max
max = x.max()
print(max)

40.0


## Applying Functions to a Series

### Lambda Functions

* Small anonymous function that is not bound to an identifier.
* Similar to user defined functions but without a name.
* It's simple and straightfoward, requiring only the argument(s) and expression, alongside the keyword `lambda`.
* They require only one line of code.

```
def func_name(parameters):
    code block
    
    return return_value
```

`func = lamda parameters: return_value`

* `lambda` : Keyword that indicates definition of a lambda function.
* `parameters`: The input parameters that the lambda function will take.
* `return_value`: A single expression that defines the compuation the lambda function performs and its return value

In [None]:
# lets compare the two


In [4]:
# lamda function
double = lambda x: x*2
double(3)

6

In [8]:
even = lambda x: x % 2 == 0
even(4)

True

### Generate Random Numbers

* Using `NumPy` library to generate random Numbers.


In [18]:
# generate random numbers
rand_num = np.random.randint(1, 10, 5)
rand_num

array([4, 6, 5, 2, 3])

In [None]:
# create a series
df = pd.Series(rand_num)

In [20]:
# display the first five rows of the series
df.head(5)

0    4
1    6
2    5
3    2
4    3
dtype: int64

In [None]:
# display last five rows
df.head(5)

### Using the `apply()` Function in a Series

* It's a powerful way to transform and analyze the data within the series.
* Above we have generate a series of random numbers, and created a function called `square` that takes in an int, squares it and return the value. Lets apply that function to the series.

In [21]:
# square the series random numbers
square = lambda x: x**2

In [None]:
# use .rename to rename the series

### `lambda` function with `apply()`

In [None]:
# Cube the numbers using lambda and apply


In [None]:
# rename the series

### Using the `map()` Function in a series

* Used to substitute each value in a Series with another value creating a convenient way to transform the values in a Series.

In [None]:
# map our random numbers as pass or fail


### `lambda` function with `map()`

In [None]:
# use lamda function with map() to double each number


In [None]:
# rename the series

### `lamda` function with Conditional Statement

In [None]:
# are the random numbers even or odd


In [None]:
# rename the series

## Series to DataFrame

* `if` a **Series** is a *table* with a single column, `elif` a **DataFrame** is a *table* with two or more columns.

In [None]:
# lets convert all the series we created into a dataframe


## Knock Yourself Out!

You work as a real estate agent at *MoringaHome Realty*. To assist your clients in making informed decisions about property investment, you decide to analyze property data using Pandas.
1. Generate 120 random numbers between  Ksh 4000 and Ksh 20,000 using numpy to represent the prices of the houses.
2. Display the first and last 7 houses.
3. Create a function that will take in the price of the house and return the category of that house, eg Suburb. The category is of your own series.
4. Apply the function created above to the series.
6. Apply a lambda function to increase the property prices by 10% due to the new tax laws.
7. Apply a custom function to increase the property prices by and additional Ksh 250 for garbage.
8. Create a new Series for each step and Finally Combine them all into a DataFrame name 'Moringa_property'.

In [139]:
house_prices =np.random.randint(4000, 20000, 120)

In [140]:
df = pd.Series(house_prices, name = 'prices')

In [141]:
def categorizer(price):
  if 4000 <= price < 8000:
    return "Slums"
  elif 8000 <= price < 16000 :
    return "Suburbs"
  elif price > 16000:
    return "Upscale"
  else:
    return "Price provided is not valid"

In [None]:
category_df = df.apply(categorizer)
category_df.rename('category', inplace=True)

Apply a lambda function to increase the property prices by 10% due to the new tax laws.

In [143]:
house_tax = lambda x: x * 1.1

In [None]:
new_prices = df.apply(house_tax)
new_prices.rename('price and tax', inplace=True)


Apply a custom function to increase the property prices by and additional Ksh 250 for garbage.

In [145]:
def garbage_fees(n):
  return n + 250

In [146]:
full_price = new_prices.apply(garbage_fees)


In [None]:
full_price.rename('price and garbage', inplace=True)

Create a new Series for each step and Finally Combine them all into a DataFrame name 'Moringa_property'.

In [148]:
moringa_property = pd.concat([df, new_prices, full_price, category_df], axis = 1)