<h1 style="background-color:#fbb714;font-family:courier;font-size:350%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Data Manipulation: NumPy & Pandas Tutorial  </h1>

# Content:

1. [NumPy](#1)
    * [Introduction](#2)
    * [What is an Array?](#3)
    * [Creating Array](#4)
    * [Array Features](#5)
    * [ReShape](#6)
    * [Concatenation](#7)
    * [Splitting](#8)
    * [Sorting](#9)
    * [Index Operations](#10)
    * [Subsets](#11)
    * [Conditional Element Operations](#12)
    * [Mathematical Operations](#13)
1. [Pandas](#14)
    * [Introduction](#15)
    * [Creating Series](#16)
    * [Creating DataFrame](#17)
    * [DataFrame Operations](#18)
    * [loc - iloc](#19)
    * [Conditional Element Operations](#20)
    * [Merging](#21)
    * [Aggregation & Grouping](#22)
    * [Pivot Table](#23)
    * [Reading Data](#24)
    * [Rule Based Classification](#25)

<a id = "1"></a><br>
<h1 style="background-color:#3d85c6;font-family:courier;font-size:300%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> NumPy </h1>

<a id = "2"></a><br>
<h1 style="background-color:#98AFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Introduction </h1>

NumPy (Numerical Python) is a powerful library in Python used for scientific computing and data processing. It provides a fundamental data structure called ndarray (N-dimensional array) for high-performance multi-dimensional arrays and mathematical functions. Consequently, NumPy is a preferred tool for scientific and engineering applications dealing with large amounts of data.

The key features of NumPy are as follows:

* **Numpy Array:** NumPy utilizes ndarray as its fundamental data structure for multi-dimensional arrays. These arrays offer faster and more efficient data storage and processing compared to Python's built-in lists. NumPy arrays can be used for vectors, matrices, and more complex data structures.

* **Broadcasting:** NumPy's broadcasting feature enables performing mathematical operations on arrays with different shapes by automatically handling dimension mismatches. This makes programming more straightforward and flexible.

* **Fast and Efficient:** NumPy performs high-performance mathematical operations using low-level code written in C. This leads to improved performance in applications dealing with large datasets and complex mathematical calculations.

* **Mathematical Functions:** NumPy provides a comprehensive mathematical library containing functions for mathematics, statistics, random number generation, linear algebra, integration, and differentiation.

* **Data Array Operations:** NumPy offers rich functions for data processing and manipulation. You can perform operations such as filtering, sorting, reshaping, rotation, and merging on data arrays.

NumPy is widely used in various fields, from scientific computations to data analysis, machine learning, and image processing. It is a fundamental tool for most data scientists and researchers, providing them with powerful and efficient capabilities for data manipulation and analysis.

<a id = "3"></a><br>
<h1 style="background-color:#98AFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> What is an Array? </h1>

An array is a data structure that stores a collection of elements, all of the same data type, in a contiguous block of memory. Arrays are used to represent multiple values or data items under a single name. Each element in an array is identified by an index, which is an integer value that specifies the element's position within the array.

In programming, arrays are essential for efficiently storing and accessing a group of elements of the same type. They are widely used in various programming languages to handle collections of data, perform mathematical computations, and organize information in a structured manner.

* In Python, there are different types of arrays:

    1. Lists: Lists are a built-in Python data type that can hold elements of different data types. They are flexible and can be resized dynamically, making them suitable for handling collections of heterogeneous data.

    2. NumPy Arrays: NumPy arrays, also known as ndarray (N-dimensional array), are a data structure provided by the NumPy library. Unlike Python lists, NumPy arrays are homogeneous and can efficiently handle large sets of numerical data. They offer faster array operations and are widely used for scientific computing, data analysis, and numerical computations.

* Arrays have many advantages, including:

    * Efficient memory usage: Arrays store elements in contiguous memory locations, reducing memory overhead and enabling fast access to elements.

    * Fast element access: Elements in an array can be accessed directly using their index, making array access faster than searching through elements in other data structures.
     
    * Efficient mathematical operations: Arrays allow vectorized operations, enabling mathematical operations to be performed on entire arrays at once, which improves performance for numerical computations.
     
    * Simplified data manipulation: Arrays support advanced indexing and slicing techniques, making it easy to extract subsets of data or perform operations on specific elements.
     
    * Better performance: The ability to perform operations on multiple elements simultaneously and the efficient memory layout of arrays contribute to improved overall performance.

In summary, an array is a data structure used to store a collection of elements of the same data type in a contiguous block of memory. It offers advantages such as efficient memory usage, fast element access, and support for mathematical operations, making it a fundamental tool in programming and data manipulation tasks.

<a id = "4"></a><br>
<h1 style="background-color:#98AFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Creating Array </h1>

To create a NumPy array, you can use the numpy.array() function or other NumPy functions that generate specific arrays. 

Some common methods to create NumPy arrays:

* Using **numpy.array()** function: You can create a NumPy array by passing a Python list or tuple to the numpy.array() function.

In [1]:
import numpy as np

# From a list:

data_list = [1, 2, 3, 4, 5]
numpy_array = np.array(data_list)

print(numpy_array)

[1 2 3 4 5]


In [2]:
# From a tuple:

data_tuple = (6, 7, 8, 9, 10)
numpy_array = np.array(data_tuple)

print(numpy_array)

[ 6  7  8  9 10]


* Using NumPy functions: NumPy provides functions like **numpy.zeros(), numpy.ones(), numpy.arange(),** etc., to create arrays with specific values.

In [3]:
# Create an array of zeros:

zeros_array = np.zeros(5)

print(zeros_array)

[0. 0. 0. 0. 0.]


In [4]:
# Create an array of ones:

ones_array = np.ones(3)

print(ones_array)

[1. 1. 1.]


In [5]:
# Create a range of values:

range_array = np.arange(1, 6)

print(range_array)

[1 2 3 4 5]


* Using random number generation: NumPy provides functions to create arrays with random values using the **numpy.random** module.

In [6]:
# Create an array of random values between 0 and 1:

random_array = np.random.random(6)

print(random_array)

[0.21659563 0.89628898 0.33528761 0.10891849 0.89372779 0.5340989 ]


* **randint()** method is a function within the numpy.random module of the NumPy library. This method is used to create a NumPy array containing random integers.

In [7]:
# Creating a 1-dimensional array of random integers between 1 and 10:

random_int_array = np.random.randint(1, 11, size=5)

print(random_int_array)

[8 8 7 7 2]


In [8]:
# Creating a 2-dimensional array of random integers between 1 and 100 (3x3):

random_matrix = np.random.randint(1, 101, size=(3, 3))

print(random_matrix)

[[14 90 78]
 [85  8 98]
 [ 7  4 66]]


* Using other NumPy functions: NumPy provides functions like **numpy.linspace(), numpy.full(),** etc., to create arrays with specific properties.

In [9]:
# Create an array with equally spaced values:

linspace_array = np.linspace(0, 1, num=5)

print(linspace_array)

[0.   0.25 0.5  0.75 1.  ]


In [10]:
# Create a 2x2 array with a specific value (e.g., 5):

full_array = np.full((2, 2), 5)

print(full_array)

[[5 5]
 [5 5]]


<a id = "5"></a><br>
<h1 style="background-color:#98AFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Array Features </h1>

* **ndim:** is a built-in function in NumPy that is used to determine the number of dimensions (axes) of a NumPy array. It provides information about the size of the array in terms of its dimensions. The ndim method is called on a NumPy array and does not take any arguments. It returns an integer value representing the number of dimensions in the array.

* **shape:**  method is a built-in function in NumPy that is used to determine the shape of a NumPy array. It provides information about the dimensions and size of the array. The shape method is called on a NumPy array and does not take any arguments. It returns a tuple of integers representing the size of the array along each axis.

* **size:** to get the size of a NumPy array, you can use the size attribute without calling it as a method. It returns an integer representing the total number of elements in the array.

* **dtype:** method is a built-in function in NumPy that is used to determine the data type of elements in a NumPy array. It provides information about the type of data that the array holds. The dtype method is called on a NumPy array and does not take any arguments. It returns a NumPy data type object, which represents the data type of the elements in the array. NumPy data types include integers, floating-point numbers, complex numbers, and other specialized data types.


---

In [11]:
import numpy as np

# 1D NumPy array:

arr_1d = np.array([1, 2, 3, 4, 5])

print(arr_1d.ndim)

1


In [12]:
# 2D NumPy array (matrix):

arr_2d = np.array([[1, 2, 3], [4, 5, 6]])

print(arr_2d.ndim)

2


In [13]:
# 3D NumPy array (3-dimensional tensor):

arr_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

print(arr_3d.ndim)

3


In [14]:
import numpy as np

# 1D NumPy array:

arr_1d = np.array([1, 2, 3, 4, 5])

print(arr_1d.shape)

(5,)


In [15]:
# 2D NumPy array (matrix):

arr_2d = np.array([[1, 2, 3], [4, 5, 6]])

print(arr_2d.shape)

(2, 3)


In [16]:
# 3D NumPy array (3-dimensional tensor):

arr_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

print(arr_3d.shape)

(2, 2, 2)


* This indicates that the arr_3d array is three-dimensional and contains 2 elements in each dimension, in the order of 2, 2, and 2. So, this array contains a total of 2 x 2 x 2 = 8 elements.

In [17]:
# Creating a 1D NumPy array:

arr_1d = np.array([1, 2, 3, 4, 5])

print(arr_1d.size)

5


In [18]:
# Creating a 2D NumPy array:

arr_2d = np.array([[1, 2, 3], [4, 5, 6]])

print(arr_2d.size)

6


In [19]:
# Creating a 3D NumPy array:

arr_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

print(arr_3d.size)

8


In [20]:
# Creating a NumPy array with integers:

arr_int = np.array([1, 2, 3, 4, 5])

print(arr_int.dtype)

int64


In [21]:
# Creating a NumPy array with floating-point numbers:

arr_float = np.array([1.1, 2.2, 3.3])

print(arr_float.dtype)

float64


In [22]:
# Creating a NumPy array with complex numbers:

arr_complex = np.array([1 + 2j, 3 + 4j])

print(arr_complex.dtype)

complex128


* Data type of the elements in the array is 128-bit complex number.

<a id = "6"></a><br>
<h1 style="background-color:#98AFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> ReShape </h1>

In NumPy,**reshape()** method is used to change the shape of a NumPy array without modifying its data. It allows you to reorganize the elements of the array into a new shape while preserving the original data.

**reshape()** method takes a tuple as its argument, representing the new shape that you want to give to the array. The elements in the tuple specify the dimensions along each axis of the new shape. The total number of elements in the original array must be equal to the total number of elements in the new shape; otherwise, a **ValueError** will be raised.

---

In [23]:
import numpy as np

# Creating a 1D NumPy array with 12 elements:

arr_1d = np.arange(1, 13)

print(arr_1d)

[ 1  2  3  4  5  6  7  8  9 10 11 12]


In [24]:
# Reshaping the 1D array into a 3x4 2D array:

arr_2d = arr_1d.reshape((3, 4))

print(arr_2d)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


In [25]:
# Creating a 1D NumPy array with 20 elements:

arr_1d_2 = np.arange(1, 21)

print(arr_1d_2)

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]


In [26]:
# Reshaping the 1D array into a 4x5 2D array:

arr_2d_2 = arr_1d_2.reshape((4, 5))

print(arr_2d_2)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]]


* In these examples, we created 1D NumPy arrays using **np.arange()** and then used **reshape()** method to transform them into 2D arrays of different shapes. The total number of elements in each array remains the same; only the organization of the elements changes based on the new shape specified in **reshape()** method.

<a id = "7"></a><br>
<h1 style="background-color:#98AFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Concatenation </h1>

In NumPy, the concatenate() function is used to concatenate arrays along a specified axis. Concatenation means combining multiple arrays to form a single array.

NumPy provides several functions to concatenate arrays, including **numpy.concatenate(), numpy.stack(), numpy.hstack(), and numpy.vstack()**. Each of these functions has different behavior when it comes to the shape and dimensions of the input arrays.

* **numpy.concatenate():** This function is used to concatenate arrays along a specified axis. It takes a sequence of arrays as input and returns a new array formed by joining them along the specified axis. The arrays being concatenated must have the same shape along the axis of concatenation.

* **numpy.stack():** This function is used to stack arrays along a new axis. It takes a sequence of arrays as input and returns a new array with an additional dimension. This is useful when you want to combine arrays along a new axis.

* **numpy.hstack():** This function is used to stack arrays horizontally (along columns). It takes a sequence of arrays as input and returns a new array with the same number of rows but with the columns of the input arrays concatenated.

* **numpy.vstack():** This function is used to stack arrays vertically (along rows). It takes a sequence of arrays as input and returns a new array with the same number of columns but with the rows of the input arrays concatenated.

---

<div style="border-radius:10px; border:#DEB887 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

<h3 align="left"><font color='#DEB887'>💡 Notes: </font></h3>

* Detailed explanation of numpy.concatenate() and the axis parameter:

    1. **numpy.concatenate():** This function is used to concatenate two or more arrays into a single array. It takes a sequence of arrays as input and returns a new array that results from concatenating those arrays. The arrays being concatenated must have the same shape along the axis of concatenation, except for the dimension on which they are being concatenated.

    2. **axis parameter:** The axis parameter is an optional argument in numpy.concatenate() function, and it specifies the axis along which the concatenation will take place. It determines how the arrays will be joined. The axis parameter should be an integer value, where:

        * **axis=0:** Specifies vertical concatenation (along rows). It stacks arrays on top of each other.
        * **axis=1:** Specifies horizontal concatenation (along columns). It concatenates arrays side by side.

---

In [27]:
# Example arrays:

arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6]])

In [28]:
# Vertical concatenation (along rows) using axis=0:

result_vertical = np.concatenate((arr1, arr2), axis=0)

print("Vertical concatenation:")

print(result_vertical)

Vertical concatenation:
[[1 2]
 [3 4]
 [5 6]]


In [29]:
# Horizontal concatenation (along columns) using axis=1:

arr3 = np.array([[7], [8]])
result_horizontal = np.concatenate((arr1, arr3), axis=1)

print("\nHorizontal concatenation:")

print(result_horizontal)


Horizontal concatenation:
[[1 2 7]
 [3 4 8]]


In [30]:
# Concatenate two 1D arrays:

arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
result = np.concatenate((arr1, arr2))

print(result)

[1 2 3 4 5 6]


In [31]:
# Stack two 1D arrays along a new axis:

result = np.stack((arr1, arr2))

print(result)

[[1 2 3]
 [4 5 6]]


In [32]:
# Stack two 1D arrays horizontally:

result = np.hstack((arr1, arr2))

print(result)

[1 2 3 4 5 6]


In [33]:
# Stack two 1D arrays vertically:

result = np.vstack((arr1, arr2))

print(result)

[[1 2 3]
 [4 5 6]]


<a id = "8"></a><br>
<h1 style="background-color:#98AFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Splitting </h1>

In NumPy, splitting refers to the process of breaking a single array into multiple smaller arrays along a specified axis. It allows you to partition an array into smaller chunks based on a specific criterion. NumPy provides several functions for splitting arrays, including **numpy.split(), numpy.array_split(), numpy.hsplit(), and numpy.vsplit()**.

* **numpy.split():** This function splits an array into multiple sub-arrays along a specified axis. It takes three arguments: the array to be split, the number of equally-sized sub-arrays to create, and the axis along which the split will occur. The array must have a size that is evenly divisible by the number of sub-arrays, otherwise, a ValueError will be raised.

* **numpy.array_split():** This function is similar to numpy.split(), but it allows you to specify the number of sub-arrays explicitly without requiring that the array size be evenly divisible by that number. The sub-arrays may have different sizes.

* **numpy.hsplit():** This function splits an array horizontally (along columns) into multiple sub-arrays. It takes two arguments: the array to be split and the number of equally-sized sub-arrays to create.

* **numpy.vsplit():** This function splits an array vertically (along rows) into multiple sub-arrays. It takes two arguments: the array to be split and the number of equally-sized sub-arrays to create.

---

In [34]:
import numpy as np

# Example array:

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Using numpy.split():

result_split = np.split(arr, 5)

print("numpy.split() result:")

print(result_split)

numpy.split() result:
[array([1, 2]), array([3, 4]), array([5, 6]), array([7, 8]), array([ 9, 10])]


In [35]:
# Using numpy.array_split():

result_array_split = np.array_split(arr, 3)

print("\nnumpy.array_split() result:")

print(result_array_split)


numpy.array_split() result:
[array([1, 2, 3, 4]), array([5, 6, 7]), array([ 8,  9, 10])]


In [36]:
# Example 2D array:

arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])

# Using numpy.hsplit():

result_hsplit = np.hsplit(arr_2d, 3)

print("\nnumpy.hsplit() result:")

print(result_hsplit)


numpy.hsplit() result:
[array([[ 1],
       [ 4],
       [ 7],
       [10]]), array([[ 2],
       [ 5],
       [ 8],
       [11]]), array([[ 3],
       [ 6],
       [ 9],
       [12]])]


In [37]:
# Using numpy.vsplit():

result_vsplit = np.vsplit(arr_2d, 2)

print("\nnumpy.vsplit() result:")

print(result_vsplit)


numpy.vsplit() result:
[array([[1, 2, 3],
       [4, 5, 6]]), array([[ 7,  8,  9],
       [10, 11, 12]])]


<a id = "9"></a><br>
<h1 style="background-color:#98AFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Sorting </h1>

In NumPy, sorting refers to the process of arranging the elements of an array in a particular order, either in ascending or descending order. Sorting arrays is a common operation in data analysis and processing tasks. NumPy provides several functions to perform sorting operations on arrays.

The main sorting functions in NumPy are:

* **numpy.sort():** This function returns a sorted copy of an array. By default, it sorts the array in ascending order along the last axis.

* **numpy.argsort()**: This function returns the indices that would sort an array. Instead of returning the sorted array, it returns an array of indices that would sort the original array.

* **numpy.lexsort():** This function performs an indirect sort using a sequence of keys. It sorts the array based on multiple keys, considering each key in the order specified.

* **numpy.msort():** This function is similar to numpy.sort(), but it is stable, meaning it maintains the relative order of elements that compare equal.

* **numpy.sort_complex():** This function sorts complex numbers based on their magnitudes.

---

In [38]:
import numpy as np

# Example array:

arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5])

# Using numpy.sort():

sorted_arr = np.sort(arr)

print("Sorted array:")

print(sorted_arr)

Sorted array:
[1 1 2 3 4 5 5 6 9]


In [39]:
# Using numpy.argsort():

sorted_indices = np.argsort(arr)

print("\nSorted indices:")
print(sorted_indices)


Sorted indices:
[1 3 6 0 2 4 8 7 5]


In [40]:
# Using numpy.lexsort():

names = np.array(['Alice', 'Bob', 'Cathy', 'David', 'Eva'])
ages = np.array([25, 22, 29, 32, 28])
sorted_indices_lex = np.lexsort((names, ages))

print("\nSorted indices using lexsort:")

print(sorted_indices_lex)


Sorted indices using lexsort:
[1 0 4 2 3]


In [41]:
# Using numpy.msort():

arr2 = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5])
stable_sorted_arr = np.msort(arr2)

print("\nStable sorted array:")

print(stable_sorted_arr)


Stable sorted array:
[1 1 2 3 4 5 5 6 9]


<a id = "10"></a><br>
<h1 style="background-color:#98AFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Index Operations </h1>

In NumPy, there are various index methods and operations to access specific elements or subarrays of NumPy arrays. Here are some commonly used index methods and operations in NumPy:

* **Integer Indexing:** You can use integers as indices to access specific elements of the array.

* **Slicing:** Slicing allows you to extract a specific portion (subarray) of an array by specifying a range of indices.

* **Boolean Indexing:** Boolean indexing allows you to filter elements from an array based on a specified condition using a boolean mask.

* **Fancy Indexing:** Fancy indexing allows you to access specific elements or subarrays by providing an array of indices.

* **Two-Dimensional Indexing:** In two-dimensional arrays, you can access specific elements or subarrays using row and column indices.

* **Changing Indices:** You can rearrange the elements of an array by changing their indices.

Below are some examples of using each index method:

---

In [42]:
import numpy as np

# Example array:

arr = np.array([10, 20, 30, 40, 50])

# Integer Indexing:

arr[2]

30

In [43]:
# Slicing:

arr[1:4]

array([20, 30, 40])

In [44]:
# Boolean Indexing:

mask = arr > 30
arr[mask]

array([40, 50])

In [45]:
# Fancy Indexing:

indices = [0, 2, 4]
arr[indices]

array([10, 30, 50])

In [46]:
# Two-Dimensional Indexing:

arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr_2d[1, 2]

6

In [47]:
# Changing Indices:

arr[1], arr[3] = arr[3], arr[1]
arr

array([10, 40, 30, 20, 50])

In [48]:
import numpy as np

# Example 2D array:

arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

arr_2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [49]:
# Selecting a single element:

arr_2d[1, 2]

6

In [50]:
# Selecting a row:

arr_2d[1, :]

array([4, 5, 6])

In [51]:
# Selecting multiple rows:

arr_2d[[0, 2], :]

array([[1, 2, 3],
       [7, 8, 9]])

In [52]:
# Selecting a column:

arr_2d[:, 0]

array([1, 4, 7])

In [53]:
# Selecting multiple columns:

arr_2d[:, [0, 2]]

array([[1, 3],
       [4, 6],
       [7, 9]])

In [54]:
# Selecting a subarray:

arr_2d[0:2, 1:3]

array([[2, 3],
       [5, 6]])

<a id = "11"></a><br>
<h1 style="background-color:#98AFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Subsets </h1>

To perform operations on NumPy subsets, you can directly apply NumPy operations on the selected subsets. NumPy supports common types of operations while working with subsets, and these operations can often be more efficient and faster.

To work with subsets, you first need to select the subset, and then apply the desired operation on that subset. NumPy subsets can be selected using methods such as slicing or boolean indexing.

Here's an example of accessing a subset of a NumPy array and performing some operations:

---

In [55]:
# Example array:

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

arr

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [56]:
# Selecting a subset (using slicing):

subset = arr[2:6]

subset

array([3, 4, 5, 6])

In [57]:
# Performing an operation on the subset:

mean_value = np.mean(subset)

mean_value

4.5

In [58]:
# Assigning values to the subset:

arr[2:6] = 10 

arr

array([ 1,  2, 10, 10, 10, 10,  7,  8,  9, 10])

* The **copy()** method in NumPy subsets is used to create an independent copy of a subset of an array. This method is very useful when you want to perform operations related to the subset without altering the original array. The **copy()** method creates a copy of the subset, allowing you to perform operations on this copy without affecting the original array.

In [59]:
# Example array:

arr = np.array([1, 2, 3, 4, 5])

arr

array([1, 2, 3, 4, 5])

In [60]:
# Selecting a subset using slicing:

subset = arr[1:4]

subset

array([2, 3, 4])

In [61]:
# Creating a copy of the subset:

subset_copy = subset.copy()

subset_copy

array([2, 3, 4])

In [62]:
# Modifying the subset and its copy:

subset[0] = 10
subset_copy[1] = 20

subset_copy

array([ 2, 20,  4])

<a id = "12"></a><br>
<h1 style="background-color:#98AFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Conditional Element Operations </h1>

NumPy provides several powerful functions and methods for conditional element operations. These operations can be used to filter or modify elements in a NumPy array based on certain conditions. Conditional element operations are commonly used for data analysis and transformations on NumPy arrays.

* **Boolean Indexing:** It is used to filter elements in an array using boolean arrays.

* **np.where():** It is used to select elements that satisfy a particular condition or replace elements with another value.

* **np.logical_and() and np.logical_or():** These functions are used to combine multiple conditions.

* **np.any() and np.all():** They return True if at least one or all elements in an array satisfy a specific condition.

---

In [63]:
# Example array:

arr = np.array([1, 2, 3, 4, 5])

arr

array([1, 2, 3, 4, 5])

In [64]:
# Boolean Indexing for filtering:

arr[arr > 3]

array([4, 5])

In [65]:
# np.where() for value replacement:

arr2 = np.where(arr > 3, 10, arr)

arr2

array([ 1,  2,  3, 10, 10])

In [66]:
# Combining multiple conditions with np.logical_and() and np.logical_or():

mask1 = arr > 2
mask2 = arr < 5

arr[np.logical_and(mask1, mask2)]

array([3, 4])

In [67]:
# Conditional evaluation with np.any() and np.all():

np.any(arr > 3)

True

In [68]:
np.all(arr > 3)

False

<a id = "13"></a><br>
<h1 style="background-color:#98AFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Mathematical Operations </h1>

In Python, NumPy provides a wide range of functions and methods for mathematical operations. NumPy offers fast and efficient tools for mathematical computations and is commonly used in scientific calculations.

* **Basic Arithmetic Operations:** You can perform basic arithmetic operations such as addition, subtraction, multiplication, and division.

* **Matrix Operations:** NumPy allows matrix addition, matrix multiplication, and finding the inverse of a matrix, among other matrix operations.

* **Trigonometric Operations:** You can calculate trigonometric functions such as sine, cosine, and tangent using NumPy trigonometric functions.

* **Logarithmic and Exponential Operations:** NumPy provides functions for logarithmic and exponential operations.

* **Statistical Operations:** You can calculate statistics like mean, standard deviation, and median using NumPy statistical functions.

* **Distance and Norm Operations:** NumPy allows you to calculate the distance between two points and the norm of a vector.

* **Random Number Generation:** NumPy provides functions to generate random numbers.

* **Linear Algebra Operations:** You can perform linear algebra operations such as vector dot product, matrix multiplication, and finding the determinant using NumPy.

---

In [69]:
# Basic arithmetic operations:

x = 5
y = 3

In [70]:
np.add(x, y)

8

In [71]:
np.subtract(x, y)

2

In [72]:
np.multiply(x, y)

15

In [73]:
np.divide(x, y)

1.6666666666666667

In [74]:
# Trigonometric operations:

angle = np.pi / 4

In [75]:
np.sin(angle)

0.7071067811865475

In [76]:
np.cos(angle)

0.7071067811865476

In [77]:
np.tan(angle)

0.9999999999999999

In [78]:
# Logarithmic and exponential operations:

num = 10

In [79]:
np.log(num)

2.302585092994046

In [80]:
np.exp(num)

22026.465794806718

In [81]:
# Statistical operations:

data = np.array([2, 5, 7, 3, 9])

In [82]:
np.mean(data)

5.2

In [83]:
np.std(data)

2.5612496949731396

In [84]:
np.median(data)

5.0

In [85]:
# Random number generation:

np.random.rand()

0.2719856652941176

In [86]:
np.random.randint(1, 10)

6

In [87]:
# Linear algebra operations:

vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])

np.dot(vector1, vector2)

32

In [88]:
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])

np.matmul(matrix1, matrix2)

array([[19, 22],
       [43, 50]])

In [89]:
np.linalg.det(matrix1)

-2.0000000000000004

In [90]:
# Solving an Equation with Two Unknowns with NumPy

a = np.array([[5, 1], [1, 3]])
b = np.array([12, 10])

np.linalg.solve(a, b)

array([1.85714286, 2.71428571])

<a id = "14"></a><br>
<h1 style="background-color:#F75D59;font-family:courier;font-size:300%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Pandas </h1>

<a id = "15"></a><br>
<h1 style="background-color:#77BFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Introduction </h1>

Pandas is an open-source library in the Python programming language used for data analysis and data manipulation. It provides powerful and flexible data structures and data analysis tools, allowing you to process and analyze your data effectively. Pandas is specifically designed for high-performance and easy operations on table-like data structures called DataFrames.

Pandas has the following core structures:

* **DataFrame:** It is a two-dimensional data structure used to store and manipulate data in a tabular format. DataFrame represents a table with columns and rows, and it allows you to organize and analyze data in a labeled format.

* **Series:** It is a one-dimensional labeled data array. A Series can be thought of as a labeled array of data points, similar to a one-dimensional array. Columns or rows within a DataFrame are represented as Series.

Pandas can be used for various operations, such as data loading, filtering, transformation, merging, grouping, and statistical computations. It also provides extensive support for reading and writing data in different formats (CSV, Excel, SQL, etc.).

Pandas is widely used by data scientists, analysts, and data engineers. You can use Pandas to efficiently process and analyze your data, making it a valuable tool in the field of data analysis.

<a id = "16"></a><br>
<h1 style="background-color:#77BFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Creating Series </h1>

You can create a Pandas Series in Python using the pd.Series() constructor. The Series is a one-dimensional labeled array that can hold data of any data type. It is similar to a NumPy array but comes with additional features, such as labeled indexing. To create a Pandas Series, you need to import the Pandas library first using import pandas as pd, and then you can use the pd.Series() constructor to create the Series.

In [91]:
import pandas as pd

# Create a Series from a Python list:

data = [10, 20, 30, 40, 50]
series1 = pd.Series(data)

series1

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [92]:
# Create a Series from a NumPy array:

data_np = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
series2 = pd.Series(data_np)

series2

0    0.1
1    0.2
2    0.3
3    0.4
4    0.5
dtype: float64

In [93]:
# Create a Series with custom index labels:

data_dict = {'A': 10, 'B': 20, 'C': 30, 'D': 40, 'E': 50}
series3 = pd.Series(data_dict)

series3

A    10
B    20
C    30
D    40
E    50
dtype: int64

<div style="border-radius:10px; border:#DEB887 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

<h3 align="left"><font color='#DEB887'>💡 Notes: </font></h3>

* Some commonly used commands for Pandas Series:

**pd.Series():** Used to create a new Pandas Series.

**head():** Used to view the first few rows of the Series.

**tail():** Used to view the last few rows of the Series.

**shape:** Returns the dimension of the Series as a tuple (rows, columns).

**size:** Returns the total number of elements in the Series.

**index:** Returns the index labels of the Series.

**values:** Returns the values (data) of the Series.

**describe():** Used to view basic statistical information about the Series (count, mean, std, min, 25%, 50%, 75%, max).

**unique():** Returns the unique values in the Series.

**nunique():** Returns the number of unique values in the Series.

**count():** Returns the total number of elements in the Series (excluding NaN values).

**sum():** Returns the sum of the values in the Series.

**mean():** Returns the mean (average) of the values in the Series.

**median():** Returns the median value of the values in the Series.

**min():** Returns the minimum value in the Series.

**max():** Returns the maximum value in the Series.

**std():** Returns the standard deviation of the values in the Series.

**var():** Returns the variance of the values in the Series.

**sort_values():** Sorts the Series by values.

**sort_index():** Sorts the Series by index labels.

**isnull():** Checks if each element is NaN and returns the results as a boolean Series.

**fillna():** Used to fill NaN values with another value.

**dropna():** Used to remove rows containing NaN values.

**apply():** Used to apply a function to each element of the Series.

**map():** Used to apply a dictionary or another Series to each element.

**replace():** Used to replace specific values with another value.

**iloc[]:** Used to access elements in the Series by positional index.

**loc[]:** Used to access elements in the Series by label index.

**astype():** Used to change the data type of the Series.

**copy():** Used to create a copy of the Series when you want to make changes without affecting the original Series.

In [94]:
series1.index

RangeIndex(start=0, stop=5, step=1)

In [95]:
series1.dtype

dtype('int64')

In [96]:
series1.size

5

In [97]:
series1.ndim

1

In [98]:
series1.values

array([10, 20, 30, 40, 50])

In [99]:
type(series1.values)

numpy.ndarray

In [100]:
series1.head(2)

0    10
1    20
dtype: int64

In [101]:
series1.tail(2)

3    40
4    50
dtype: int64

<a id = "17"></a><br>
<h1 style="background-color:#77BFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Creating DataFrame </h1>

The examples of creating a DataFrame in pandas:

In [102]:
# Create lists containing data:

names = ['John', 'Jane', 'Mike']
ages = [25, 30, 22]
cities = ['New York', 'London', 'Paris']

In [103]:
# Create a DataFrame from lists:

df = pd.DataFrame({'Name': names, 'Age': ages, 'City': cities})

df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Jane,30,London
2,Mike,22,Paris


In [104]:
# Create a dictionary containing data:

data = {
    'Name': ['John', 'Jane', 'Mike'],
    'Age': [25, 30, 22],
    'City': ['New York', 'London', 'Paris']
}

In [105]:
# Create a DataFrame from dictionary:

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Jane,30,London
2,Mike,22,Paris


In [106]:
# Create a dictionary containing data:

data = {
    'Name': ['John', 'Jane', 'Mike'],
    'Age': [25, 30, 22],
    'City': ['New York', 'London', 'Paris']
}

In [107]:
# Create a DataFrame from dictionary and set index:

df = pd.DataFrame(data, index=['person1', 'person2', 'person3'])

df

Unnamed: 0,Name,Age,City
person1,John,25,New York
person2,Jane,30,London
person3,Mike,22,Paris


In [108]:
# Create a DataFrame by reading from a CSV file:

df = pd.read_csv('/kaggle/input/titanic/train.csv')

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [109]:
# Excel dosyasından DataFrame oluşturma:

# df = pd.read_excel('veri.xlsx')

<a id = "18"></a><br>
<h1 style="background-color:#77BFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> DataFrame Operations </h1>

In [110]:
import pandas as pd
import seaborn as sns

df = pd.read_csv('/kaggle/input/titanic/train.csv')



* **Quick Look at Data:**

In [111]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [112]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [113]:
df.shape

(891, 12)

In [114]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [115]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [116]:
df.index

RangeIndex(start=0, stop=891, step=1)

In [117]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PassengerId,891.0,446.0,257.353842,1.0,223.5,446.0,668.5,891.0
Survived,891.0,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Age,714.0,29.699118,14.526497,0.42,20.125,28.0,38.0,80.0
SibSp,891.0,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Fare,891.0,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292


In [118]:
df.isnull().values.any()

True

In [119]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [120]:
# Accessing a specific header:

df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [121]:
# Selecting specific columns:

df[['Survived', 'Pclass']]

Unnamed: 0,Survived,Pclass
0,0,3
1,1,1
2,1,3
3,1,1
4,0,3
...,...,...
886,0,2
887,1,1
888,0,3
889,1,1


In [122]:
# Accessing a specific row with label index:

df.loc[0]

PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                               22.0
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object

In [123]:
# Accessing a specific row with numerical index:

df.iloc[0]

PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                               22.0
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object

* **Modifying Data:**

In [124]:
# Modifying column values:

df['Age'] = df['Age'] + 1

* **Data Filtering:**

In [125]:
# Filtering data based on conditions:

df[df['Age'] > 25].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,39.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,27.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,36.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,36.0,0,0,373450,8.05,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,55.0,0,0,17463,51.8625,E46,S


* **Handling NaN Values:**

In [126]:
# Dropping rows with NaN values:

df.dropna().head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,39.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,36.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,55.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,5.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,59.0,0,0,113783,26.55,C103,S


In [127]:
# Filling NaN values with another value:

df.fillna(0).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,23.0,1,0,A/5 21171,7.25,0,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,39.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,27.0,0,0,STON/O2. 3101282,7.925,0,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,36.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,36.0,0,0,373450,8.05,0,S


* **Data Grouping and Statistical Operations:**

In [128]:
grouped_df = df.groupby('Ticket')

grouped_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,23.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,39.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,27.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,36.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,36.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,28.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,20.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,27.0,0,0,111369,30.0000,C148,C


In [129]:
# Calculating mean values for the grouped data:

mean_age_by_city = grouped_df['Age'].mean()

* **Data Sorting:**

In [130]:
# Sorting data by a column:

sorted_df = df.sort_values('Age')

sorted_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
803,804,1,3,"Thomas, Master. Assad Alexander",male,1.42,0,1,2625,8.5167,,C
755,756,1,2,"Hamalainen, Master. Viljo",male,1.67,1,1,250649,14.5,,S
644,645,1,3,"Baclini, Miss. Eugenie",female,1.75,2,1,2666,19.2583,,C
469,470,1,3,"Baclini, Miss. Helene Barbara",female,1.75,2,1,2666,19.2583,,C
78,79,1,2,"Caldwell, Master. Alden Gates",male,1.83,0,2,248738,29.0,,S


<div style="border-radius:10px; border:#DEB887 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

<h3 align="left"><font color='#DEB887'>💡 Notes: </font></h3>

* The axis parameter in Pandas DataFrame is used to specify which axis to perform an operation on.

    * **axis=0**: This specifies that the operation will be performed along the rows (observations) of the DataFrame. So, when you want to perform row-wise operations, you use axis=0. This is the default for most DataFrame operations.
    * **axis=1**: This specifies that the operation will be performed along the columns (variables) of the DataFrame. So, when you want to perform column-wise operations, you use axis=1.
    
* The inplace parameter in Pandas DataFrame is used to determine whether the operation will modify the current DataFrame or not. By default, in most DataFrame operations, **inplace=False**, which means the operation returns the result without modifying the original DataFrame.

* However, when **inplace=True**, the operation is applied directly to the current DataFrame, and DataFrame is modified. In this case, the operation doesn't return anything, and the return value is None.

In [131]:
import pandas as pd

# Let's create a DataFrame:

data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6]
}

df = pd.DataFrame(data)

df

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [132]:
# Usage of axis (Summing along the rows):

data = df.sum(axis=0)

data

A     6
B    15
dtype: int64

In [133]:
# Usage of axis (Summing along the columns):

data = df.sum(axis=1)

data

0    5
1    7
2    9
dtype: int64

* **Selection in Pandas:**

In [134]:
import pandas as pd
import seaborn as sns

df = pd.read_csv('/kaggle/input/titanic/train.csv')

In [135]:
df[0:8]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [136]:
df.drop(0, axis=0).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [137]:
delete_indexes = [1, 3, 5, 7]

df.drop(delete_indexes, axis=0).head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S


* **Converting Variable to Index:**

In [138]:
df["Age"].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [139]:
df.index = df["Age"]

In [140]:
df.drop("Age", axis=1).head()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
22.0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25,,S
38.0,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C85,C
26.0,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.925,,S
35.0,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1,C123,S
35.0,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.05,,S


In [141]:
df.drop("Age", axis=1, inplace=True)

df.head()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
22.0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25,,S
38.0,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C85,C
26.0,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.925,,S
35.0,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1,C123,S
35.0,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.05,,S


* **Convert index to variable:**

In [142]:
df.index

Float64Index([22.0, 38.0, 26.0, 35.0, 35.0,  nan, 54.0,  2.0, 27.0, 14.0,
              ...
              33.0, 22.0, 28.0, 25.0, 39.0, 27.0, 19.0,  nan, 26.0, 32.0],
             dtype='float64', name='Age', length=891)

In [143]:
df["Age"] = df.index

In [144]:
df.head()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
22.0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25,,S,22.0
38.0,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C85,C,38.0
26.0,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.925,,S,26.0
35.0,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1,C123,S,35.0
35.0,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.05,,S,35.0


In [145]:
df.drop("Age", axis=1, inplace=True)

In [146]:
df.reset_index().head()

Unnamed: 0,Age,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,22.0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25,,S
1,38.0,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C85,C
2,26.0,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.925,,S
3,35.0,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1,C123,S
4,35.0,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.05,,S


In [147]:
df = df.reset_index()

* **Operations on Variables:**

In [148]:
"Age" in df

True

In [149]:
df["Age"].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [150]:
df.Age.head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [151]:
type(df["Age"].head())

pandas.core.series.Series

In [152]:
df[["Age"]].head()

Unnamed: 0,Age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0


In [153]:
type(df[["Age"]].head())

pandas.core.frame.DataFrame

In [154]:
df["Age2"] = df["Age"]**2

In [155]:
df["Age3"] = df["Age"] / df["Age2"]

In [156]:
df.head()

Unnamed: 0,Age,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age2,Age3
0,22.0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25,,S,484.0,0.045455
1,38.0,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C85,C,1444.0,0.026316
2,26.0,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.925,,S,676.0,0.038462
3,35.0,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1,C123,S,1225.0,0.028571
4,35.0,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.05,,S,1225.0,0.028571


In [157]:
df = df.drop(["Age3","Age2"], axis=1)

In [158]:
df.head()

Unnamed: 0,Age,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,22.0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25,,S
1,38.0,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C85,C
2,26.0,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.925,,S
3,35.0,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1,C123,S
4,35.0,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.05,,S


In [159]:
df.loc[:, ~df.columns.str.contains("Age")].head() # It is used to capture suffixes and contains, except ~ statement.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.05,,S


<div style="border-radius:10px; border:#DEB887 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

<h3 align="left"><font color='#DEB887'>💡 Notes: </font></h3>
    
Explanation of how this code works:

1. `df` represents a pandas DataFrame object. We want to perform operations on this DataFrame.

2. The `.loc[]` method allows you to access specific rows and columns in the DataFrame. Here, we want to filter only the columns, so we select all rows using ":".

3. The expression `~df.columns.str.contains("Age")` checks the column names in the DataFrame and aims to select columns that do not contain the word "Age."

   - `df.columns` represents an Index object containing column names.
   - The `.str.contains("Age")` method is used to check if column names contain the word "Age" and returns a boolean (True/False) value for each column.
   - The `~` symbol negates these boolean values, so columns that don't contain the word "Age" are marked as `True`, and columns containing "Age" are marked as `False`.

4. As a result, the `.loc[:, ~df.columns.str.contains("Age")]` expression selects all rows and the columns that do not contain the word "Age."

5. The `.head()` method returns the first few rows (by default, the first 5 rows) of this filtered DataFrame.

<a id = "19"></a><br>
<h1 style="background-color:#77BFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> loc - iloc </h1>

First, Let's talk about indexes.

In pandas, an index is a unique identifier for rows in a DataFrame. It provides a way to access and reference data in a more meaningful and efficient manner. By default, when you create a DataFrame, pandas assigns a numerical index starting from 0 to the rows. However, you can also specify a column or multiple columns from the DataFrame to use as the index.

Some key points about DataFrame indexes in pandas:

* **Default Index:** When you create a DataFrame without specifying an index, pandas automatically assigns a default numerical index starting from 0 and incrementing by 1 for each row.

* **Specifying Index:** You can explicitly set an existing column as the index during DataFrame creation using the index parameter. For example: df = pd.DataFrame(data, index=my_index_column).

* **Setting Index Later:** You can also set the index later using the set_index() method of the DataFrame. For example: df.set_index('column_name', inplace=True).

* **Multi-level Index:** A DataFrame can have a multi-level index, where you can use multiple columns as the index, creating a hierarchical structure.

* **Accessing Rows with Index:** Once you set an index, you can use it to access rows in a DataFrame using loc[] method. For example: df.loc['index_value'].

* **Resetting Index:** If you want to remove the current index and revert to the default numerical index, you can use the reset_index() method.

---

In [160]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

# Creating a DataFrame with 'Name' column as the index:

df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])

df

Unnamed: 0,Name,Age,City
A,Alice,25,New York
B,Bob,30,San Francisco
C,Charlie,35,Los Angeles
D,David,28,Chicago


<div style="border-radius:10px; border:#DEB887 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

<h3 align="left"><font color='#DEB887'>💡 Notes: </font></h3>

**1. DataFrame loc[] method:**

* The loc[] method is used to access rows and columns by their label names.
* It allows you to perform selection based on the labels of the rows and columns.
* loc[] is used for label-based operations.
* The method works with labels and always returns a DataFrame containing the specified labels of rows and columns.
* The result of the operation is a subset of rows and columns containing the specified labels.

**2. DataFrame iloc[] method:**

* The iloc[] method is used to access rows and columns by their integer-based positions.
* It allows you to perform selection based on the integer index of the rows and columns.
* iloc[] is used for integer-based operations.
* The method works with integer indexes and always returns a DataFrame containing the specified integer indexes of rows and columns.
* The result of the operation is a subset of rows and columns containing the specified integer indexes.

---

In [161]:
import pandas as pd

# Let's create a DataFrame:

data = {
    'Name': ['John', 'Jane', 'Mike', 'Emily'],
    'Age': [25, 30, 22, 28],
    'City': ['New York', 'London', 'Paris', 'Sydney']
}

df = pd.DataFrame(data)

# loc[] for label-based operation:

df.loc[0:3,"Age"]    # Retrieves the row with label 'Jane'

0    25
1    30
2    22
3    28
Name: Age, dtype: int64

In [162]:
# iloc[] for integer-based operation:

df.iloc[0:3]   # Retrieves the row at index 1

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Jane,30,London
2,Mike,22,Paris


---

In [163]:
import pandas as pd
import seaborn as sns

df = pd.read_csv('/kaggle/input/titanic/train.csv')

In [164]:
# iloc: integer based selection

df.iloc[0:3]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [165]:
df.iloc[0, 0]

1

In [166]:
# loc: label based selection

df.loc[0:3]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


In [167]:
df.iloc[0:3, 0:3]

Unnamed: 0,PassengerId,Survived,Pclass
0,1,0,3
1,2,1,1
2,3,1,3


In [168]:
df.loc[0:3, "Age"]

0    22.0
1    38.0
2    26.0
3    35.0
Name: Age, dtype: float64

<a id = "20"></a><br>
<h1 style="background-color:#77BFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Conditional Element Operations </h1>

1. **Conditional Selection:**

Conditional selection is used to select rows or columns in a DataFrame that satisfy a specific condition.

2. **Conditional Filtering:**

Conditional filtering is used to filter rows or columns in a DataFrame based on a specific condition.

3. **Reassignment:**

Reassignment is used to update the values of elements that satisfy a specific condition in a DataFrame.

4. **Handling NaN Values:**

Handling NaN (Not a Number) values is used to replace or remove NaN values in a DataFrame.

---

In [169]:
import pandas as pd

# Let's create a DataFrame:

data = {
    'Name': ['John', 'Jane', 'Mike', 'Emily'],
    'Age': [25, 30, 22, 28],
    'City': ['New York', 'London', 'Paris', 'Sydney']
}

df = pd.DataFrame(data)

# Select people older than 25:

df[df['Age'] > 25]

Unnamed: 0,Name,Age,City
1,Jane,30,London
3,Emily,28,Sydney


In [170]:
# Filter people living in "London":

df[df['City'] == 'London']

Unnamed: 0,Name,Age,City
1,Jane,30,London


In [171]:
# Update the age of "Jane" to 35:

df.loc[df['Name'] == 'Jane', 'Age'] = 35

df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Jane,35,London
2,Mike,22,Paris
3,Emily,28,Sydney


In [172]:
# Replace NaN values with "Unknown":

df.fillna('Unknown', inplace=True)

In [173]:
# Drop rows containing NaN values:

df.dropna(inplace=True)

---

In [174]:
import pandas as pd
import seaborn as sns

df = pd.read_csv('/kaggle/input/titanic/train.csv')

In [175]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [176]:
df.loc[df["Age"] > 50, ["Age", "Pclass"]].head()

Unnamed: 0,Age,Pclass
6,54.0,1
11,58.0,1
15,55.0,2
33,66.0,2
54,65.0,1


In [177]:
df.loc[(df["Age"] > 50) & (df["Sex"] == "male"), ["Age", "Pclass"]].head()

Unnamed: 0,Age,Pclass
6,54.0,1
33,66.0,2
54,65.0,1
94,59.0,3
96,71.0,1


In [178]:
df.loc[(df["Age"] > 50) & (df["Sex"] == "male") & ((df["Embarked"] == "C") | (df["Embarked"] == "S")), ["Age", "Pclass", "Embarked"]].head()

Unnamed: 0,Age,Pclass,Embarked
6,54.0,1,S
33,66.0,2,S
54,65.0,1,C
94,59.0,3,S
96,71.0,1,C


<a id = "21"></a><br>
<h1 style="background-color:#77BFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Merging </h1>

1. **concat():** The concat() method is used to concatenate multiple DataFrames along a specified axis (rows or columns). It combines the data either row-wise (along rows) or column-wise (along columns).

2. **merge():** The merge() method is used to merge two DataFrames based on specific columns. This method is similar to the "join" operation in SQL, and it combines the data based on the overlapping columns.

3. **join():** The join() method is used to merge two DataFrames based on their indices. Unlike the merge() method, join() performs merging based on the DataFrame indices.

---

In [179]:
# Let's create DataFrames:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2'],
                    'C': ['C0', 'C1', 'C2']},
                   index=[0, 1, 2])

df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
                    'B': ['B3', 'B4', 'B5'],
                    'C': ['C3', 'C4', 'C5']},
                   index=[3, 4, 5])

In [180]:
df1

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


In [181]:
df2

Unnamed: 0,A,B,C
3,A3,B3,C3
4,A4,B4,C4
5,A5,B5,C5


In [182]:
# Concatenate DataFrames row-wise:

result_concat_row = pd.concat([df1, df2])

result_concat_row

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2
3,A3,B3,C3
4,A4,B4,C4
5,A5,B5,C5


In [183]:
# Concatenate DataFrames column-wise:

result_concat_column = pd.concat([df1, df2], axis=1)

result_concat_column

Unnamed: 0,A,B,C,A.1,B.1,C.1
0,A0,B0,C0,,,
1,A1,B1,C1,,,
2,A2,B2,C2,,,
3,,,,A3,B3,C3
4,,,,A4,B4,C4
5,,,,A5,B5,C5


----

In [184]:
# Let's create DataFrames:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2'],
                    'key': ['K0', 'K1', 'K2']})

df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2'],
                    'D': ['D0', 'D1', 'D2'],
                    'key': ['K0', 'K1', 'K2']})

In [185]:
df1

Unnamed: 0,A,B,key
0,A0,B0,K0
1,A1,B1,K1
2,A2,B2,K2


In [186]:
df2

Unnamed: 0,C,D,key
0,C0,D0,K0
1,C1,D1,K1
2,C2,D2,K2


In [187]:
# Merge DataFrames based on the 'key' column:

result_merge = pd.merge(df1, df2, on='key')

result_merge

Unnamed: 0,A,B,key,C,D
0,A0,B0,K0,C0,D0
1,A1,B1,K1,C1,D1
2,A2,B2,K2,C2,D2


---

In [188]:
# Let's create DataFrames:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']},
                   index=['K0', 'K1', 'K2'])

df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2'],
                    'D': ['D0', 'D1', 'D2']},
                   index=['K0', 'K1', 'K3'])

In [189]:
# Join DataFrames based on their indices:

result_join = df1.join(df2, how='outer')

result_join

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,C1,D1
K2,A2,B2,,
K3,,,C2,D2


---

In [190]:
df1 = pd.DataFrame({'employees': ['john', 'dennis', 'mark', 'maria'],
                    'group': ['accounting', 'engineering', 'engineering', 'hr']})

df2 = pd.DataFrame({'employees': ['mark', 'john', 'dennis', 'maria'],
                    'start_date': [2010, 2009, 2014, 2019]})

In [191]:
df1

Unnamed: 0,employees,group
0,john,accounting
1,dennis,engineering
2,mark,engineering
3,maria,hr


In [192]:
df2

Unnamed: 0,employees,start_date
0,mark,2010
1,john,2009
2,dennis,2014
3,maria,2019


In [193]:
pd.merge(df1, df2, on="employees")

Unnamed: 0,employees,group,start_date
0,john,accounting,2009
1,dennis,engineering,2014
2,mark,engineering,2010
3,maria,hr,2019


In [194]:
df3 = pd.merge(df1, df2)

In [195]:
df4 = pd.DataFrame({'group': ['accounting', 'engineering', 'hr'],
                    'manager': ['Caner', 'Mustafa', 'Berkcan']})

In [196]:
pd.merge(df3, df4)

Unnamed: 0,employees,group,start_date,manager
0,john,accounting,2009,Caner
1,dennis,engineering,2014,Mustafa
2,mark,engineering,2010,Mustafa
3,maria,hr,2019,Berkcan


<a id = "22"></a><br>
<h1 style="background-color:#77BFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Aggregation & Grouping </h1>

**1. Grouping Operations**

Pandas DataFrame grouping operations involve dividing the data into groups based on specific columns or multiple columns and performing operations on each group independently. Grouping allows you to analyze and summarize data within each group separately.

Next, you can use **groupby()** method of the DataFrame to group the data based on one or more columns. **groupby()** method returns a DataFrameGroupBy object that represents the grouped data.

Once you have the grouped DataFrame, you can perform various aggregation or transformation operations on it. Aggregation functions calculate summary statistics on each group, while transformation functions perform group-wise operations and return a DataFrame with the same shape as the original.

**2. Aggregation Functions**

DataFrame aggregation functions in pandas are used to compute summary statistics on the data in each group after performing grouping operations. These functions aggregate the data into a single value for each group, allowing you to quickly analyze and summarize data based on specific categories or groups.

<div style="border-radius:10px; border:#DEB887 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

<h3 align="left"><font color='#DEB887'>💡 Notes: </font></h3>
    
* **sum():** Calculates the sum of values in each group.
* **mean():** Calculates the mean (average) of values in each group.
* **median():** Calculates the median of values in each group.
* **min():** Calculates the minimum value in each group.
* **max():** Calculates the maximum value in each group.
* **count():** Counts the number of occurrences in each group.
* **first():** Returns the first element of each group.
* **last():** Returns the last element of each group.
* **var():** Computes the variance of each group.
* **std():** Computes the standard deviation of each group.

In [197]:
import pandas as pd

data = {
    'Category': ['A', 'A', 'B', 'B', 'A', 'B'],
    'Value': [10, 20, 15, 25, 5, 30]
}

df = pd.DataFrame(data)

In [198]:
df.groupby('Category')['Value'].sum()

Category
A    35
B    70
Name: Value, dtype: int64

In [199]:
df.groupby('Category')['Value'].mean()

Category
A    11.666667
B    23.333333
Name: Value, dtype: float64

In [200]:
df.groupby('Category')['Value'].median()

Category
A    10.0
B    25.0
Name: Value, dtype: float64

In [201]:
df.groupby('Category')['Value'].min()

Category
A     5
B    15
Name: Value, dtype: int64

In [202]:
df.groupby('Category')['Value'].max()

Category
A    20
B    30
Name: Value, dtype: int64

In [203]:
df.groupby('Category')['Value'].count()

Category
A    3
B    3
Name: Value, dtype: int64

In [204]:
df.groupby('Category')['Value'].first()

Category
A    10
B    15
Name: Value, dtype: int64

In [205]:
df.groupby('Category')['Value'].last()

Category
A     5
B    30
Name: Value, dtype: int64

In [206]:
df.groupby('Category')['Value'].var()

Category
A    58.333333
B    58.333333
Name: Value, dtype: float64

In [207]:
df.groupby('Category')['Value'].std()

Category
A    7.637626
B    7.637626
Name: Value, dtype: float64

---

<div style="border-radius:10px; border:#DEB887 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

<h3 align="left"><font color='#DEB887'>💡 Notes: </font></h3>

* **aggregate():** The aggregate() function in pandas is used for performing aggregation operations on the data in a DataFrame. It allows you to apply multiple aggregation functions to different columns of the DataFrame at the same time. You can pass a dictionary of column names and corresponding aggregation functions to the aggregate() function.

* **filter():** The filter() function is used to filter data from a DataFrame based on specific conditions. It allows you to filter rows or columns of a DataFrame based on a function, a condition, or the presence of specific values.

* **apply():** The apply() function is used to apply a function along the axis of a DataFrame. It allows you to apply a custom function to each row or column of the DataFrame.

---

In [208]:
import pandas as pd

data = {
    'Category': ['A', 'A', 'B', 'B', 'A', 'B'],
    'Value': [10, 20, 15, 25, 5, 30]
}

df = pd.DataFrame(data)

In [209]:
# Apply multiple aggregation functions to the 'Value' column:

df.groupby('Category').agg({'Value': ['sum', 'mean', 'max']})

Unnamed: 0_level_0,Value,Value,Value
Unnamed: 0_level_1,sum,mean,max
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
A,35,11.666667,20
B,70,23.333333,30


In [210]:
# Filter rows where the sum of 'Value' in each group is greater than 40:

df.groupby('Category').filter(lambda x: x['Value'].sum() > 40)

Unnamed: 0,Category,Value
2,B,15
3,B,25
5,B,30


---

In [211]:
import pandas as pd
import seaborn as sns

df = pd.read_csv('/kaggle/input/titanic/train.csv')

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [212]:
df["Age"].mean()

29.69911764705882

In [213]:
df.groupby("Sex")["Age"].mean()

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

In [214]:
df.groupby("Sex").agg({"Age": ["mean", "sum"]})

Unnamed: 0_level_0,Age,Age
Unnamed: 0_level_1,mean,sum
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2
female,27.915709,7286.0
male,30.726645,13919.17


In [215]:
df.groupby("Sex").agg({"Age": ["mean", "sum"],"Survived": "mean"})

Unnamed: 0_level_0,Age,Age,Survived
Unnamed: 0_level_1,mean,sum,mean
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
female,27.915709,7286.0,0.742038
male,30.726645,13919.17,0.188908


In [216]:
df.groupby(["Sex", "Embarked"]).agg({"Age": ["mean"],"Survived": "mean"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Survived
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,mean
Sex,Embarked,Unnamed: 2_level_2,Unnamed: 3_level_2
female,C,28.344262,0.876712
female,Q,24.291667,0.75
female,S,27.771505,0.689655
male,C,32.998841,0.305263
male,Q,30.9375,0.073171
male,S,30.29144,0.174603


In [217]:
df.groupby(["Sex", "Embarked", "Pclass"]).agg({"Age": ["mean"],"Survived": "mean","Sex": "count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Age,Survived,Sex
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,mean,count
Sex,Embarked,Pclass,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
female,C,1,36.052632,0.976744,43
female,C,2,19.142857,1.0,7
female,C,3,14.0625,0.652174,23
female,Q,1,33.0,1.0,1
female,Q,2,30.0,1.0,2
female,Q,3,22.85,0.727273,33
female,S,1,32.704545,0.958333,48
female,S,2,29.719697,0.910448,67
female,S,3,23.223684,0.375,88
male,C,1,40.111111,0.404762,42


---

In [218]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [219]:
df["Age2"] = df["Age"]*2
df["Age3"] = df["Age"]*5

In [220]:
for col in df.columns:
    if "Age" in col:
        print(col)

for col in df.columns:
    if "Age" in col:
        print((df[col]/10).head())

for col in df.columns:
    if "Age" in col:
        df[col] = df[col]/10

Age
Age2
Age3
0    2.2
1    3.8
2    2.6
3    3.5
4    3.5
Name: Age, dtype: float64
0    4.4
1    7.6
2    5.2
3    7.0
4    7.0
Name: Age2, dtype: float64
0    11.0
1    19.0
2    13.0
3    17.5
4    17.5
Name: Age3, dtype: float64


In [221]:
df[["Age", "Age2", "Age3"]].apply(lambda x: x/10).head()

Unnamed: 0,Age,Age2,Age3
0,0.22,0.44,1.1
1,0.38,0.76,1.9
2,0.26,0.52,1.3
3,0.35,0.7,1.75
4,0.35,0.7,1.75


In [222]:
df.loc[:, df.columns.str.contains("Age")].apply(lambda x: x/10).head()

Unnamed: 0,Age,Age2,Age3
0,0.22,0.44,1.1
1,0.38,0.76,1.9
2,0.26,0.52,1.3
3,0.35,0.7,1.75
4,0.35,0.7,1.75


In [223]:
df.loc[:, df.columns.str.contains("Age")].apply(lambda x: (x - x.mean()) / x.std()).head()

Unnamed: 0,Age,Age2,Age3
0,-0.530005,-0.530005,-0.530005
1,0.57143,0.57143,0.57143
2,-0.254646,-0.254646,-0.254646
3,0.364911,0.364911,0.364911
4,0.364911,0.364911,0.364911


In [224]:
def standart_scaler(col_name):
    return (col_name - col_name.mean()) / col_name.std()

df.loc[:, df.columns.str.contains("Age")].apply(standart_scaler).head()

Unnamed: 0,Age,Age2,Age3
0,-0.530005,-0.530005,-0.530005
1,0.57143,0.57143,0.57143
2,-0.254646,-0.254646,-0.254646
3,0.364911,0.364911,0.364911
4,0.364911,0.364911,0.364911


---

In [225]:
data = sns.load_dataset("tips")

data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [226]:
data.groupby('time').agg({"total_bill" : ['sum', 'min', 'max', 'mean']})

Unnamed: 0_level_0,total_bill,total_bill,total_bill,total_bill
Unnamed: 0_level_1,sum,min,max,mean
time,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Lunch,1167.47,7.51,43.11,17.168676
Dinner,3660.3,3.07,50.81,20.797159


In [227]:
data.groupby(["day","time"]).agg({"total_bill" :['sum', 'min', 'max', 'mean']})

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,min,max,mean
day,time,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Thur,Lunch,1077.55,7.51,43.11,17.664754
Thur,Dinner,18.78,18.78,18.78,18.78
Fri,Lunch,89.92,8.58,16.27,12.845714
Fri,Dinner,235.96,5.75,40.17,19.663333
Sat,Lunch,0.0,,,
Sat,Dinner,1778.4,3.07,50.81,20.441379
Sun,Lunch,0.0,,,
Sun,Dinner,1627.16,7.25,48.17,21.41


In [228]:
data1 = data[(data['time'] == 'Lunch') & (data['sex'] == 'Female')]

data1.groupby(["day"]).agg({"total_bill" : ['sum', 'min', 'max', 'mean'], "tip" : ['sum', 'min', 'max', 'mean']})

Unnamed: 0_level_0,total_bill,total_bill,total_bill,total_bill,tip,tip,tip,tip
Unnamed: 0_level_1,sum,min,max,mean,sum,min,max,mean
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Thur,516.11,8.35,43.11,16.64871,79.42,1.25,5.17,2.561935
Fri,55.76,10.09,16.27,13.94,10.98,2.0,3.48,2.745
Sat,0.0,,,,0.0,,,
Sun,0.0,,,,0.0,,,


In [229]:
data[(data["size"] < 3) & (data["total_bill"] > 10)].agg({"total_bill" : ["mean"]})

Unnamed: 0,total_bill
mean,17.184965


In [230]:
data["total_bill_tip_sum"] = data['total_bill'] + data['tip']

In [231]:
data.sort_values(by = "total_bill_tip_sum",ascending=False, ignore_index=True).head(10)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,total_bill_tip_sum
0,50.81,10.0,Male,Yes,Sat,Dinner,3,60.81
1,48.33,9.0,Male,No,Sat,Dinner,4,57.33
2,48.27,6.73,Male,No,Sat,Dinner,4,55.0
3,48.17,5.0,Male,No,Sun,Dinner,6,53.17
4,45.35,3.5,Male,Yes,Sun,Dinner,3,48.85
5,43.11,5.0,Female,Yes,Thur,Lunch,4,48.11
6,39.42,7.58,Male,No,Sat,Dinner,4,47.0
7,44.3,2.5,Female,Yes,Sat,Dinner,3,46.8
8,41.19,5.0,Male,No,Thur,Lunch,5,46.19
9,40.17,4.73,Male,Yes,Fri,Dinner,4,44.9


<a id = "23"></a><br>
<h1 style="background-color:#77BFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Pivot Table </h1>

In pandas, a pivot table is a powerful data manipulation tool that allows you to summarize and reshape data by performing aggregation operations on it. It takes a DataFrame as input and returns a new DataFrame where the rows and columns have been rearranged based on specified criteria. **pivot_table()** function in pandas is used to create pivot tables.

In [232]:
# pd.pivot_table(
#     data,               # The DataFrame to use for creating the pivot table
#     values=None,        # The column to aggregate. Can be a list for multiple columns.
#     index=None,         # The column(s) to use as index (rows) of the pivot table.
#     columns=None,       # The column(s) to use as columns of the pivot table.
#     aggfunc='mean',     # The aggregation function(s) to apply to the values. Default is 'mean'.
#     fill_value=None,    # The value to replace missing values. Default is None.
#     margins=False,      # If True, add a row and a column containing the grand total. Default is False.
#     margins_name='All'  # The name of the row/column that contains the grand total. Default is 'All'.
# )

<div style="border-radius:10px; border:#DEB887 solid; padding: 15px; background-color: #FFFAF0; font-size:100%; text-align:left">

<h3 align="left"><font color='#DEB887'>💡 Notes: </font></h3>

* **data:** The DataFrame that contains the data to be used for creating the pivot table.
* **values:** The column(s) for which you want to calculate the summary statistics. It can be a single column or a list of columns for multiple values.
* **index:** The column(s) to use as the index (rows) of the pivot table.
* **columns:** The column(s) to use as the columns of the pivot table.
* **aggfunc:** The aggregation function(s) to apply to the values. By default, it is set to 'mean', but you can use other functions such as 'sum', 'count', 'min', 'max', or even custom functions.
* **fill_value:** The value to replace missing or NaN values in the resulting pivot table.
* **margins:** If True, it adds a row and a column containing the grand total. The default is False.
* **margins_name:** The name of the row/column that contains the grand total. The default is 'All'.

In [233]:
# Let's say we have a DataFrame containing sales data:

data = {
    'Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Electronics', 'Clothing'],
    'City': ['New York', 'New York', 'Los Angeles', 'Los Angeles', 'Chicago', 'Chicago'],
    'Sales': [1000, 500, 800, 300, 600, 200]
}

df = pd.DataFrame(data)

In [234]:
# We can create a pivot table to summarize the total sales for each category and city:

pivot_table = pd.pivot_table(df, values='Sales', index='Category', columns='City', aggfunc='sum')

pivot_table

City,Chicago,Los Angeles,New York
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Clothing,200,300,500
Electronics,600,800,1000


---

In [235]:
import pandas as pd
import seaborn as sns

df = pd.read_csv('/kaggle/input/titanic/train.csv')

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [236]:
df.pivot_table("Survived", "Sex", "Embarked")

Embarked,C,Q,S
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.876712,0.75,0.689655
male,0.305263,0.073171,0.174603


In [237]:
df.pivot_table("Survived", "Sex", ["Embarked", "Pclass"])

Embarked,C,C,C,Q,Q,Q,S,S,S
Pclass,1,2,3,1,2,3,1,2,3
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
female,0.976744,1.0,0.652174,1.0,1.0,0.727273,0.958333,0.910448,0.375
male,0.404762,0.2,0.232558,0.0,0.0,0.076923,0.35443,0.154639,0.128302


In [238]:
df["new_age"] = pd.cut(df["Age"], [0, 10, 18, 25, 40, 90])

In [239]:
df.pivot_table("Survived", "Sex", ["new_age", "Pclass"])

new_age,"(0, 10]","(0, 10]","(0, 10]","(10, 18]","(10, 18]","(10, 18]","(18, 25]","(18, 25]","(18, 25]","(25, 40]","(25, 40]","(25, 40]","(40, 90]","(40, 90]","(40, 90]"
Pclass,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
female,0.0,1.0,0.5,1.0,1.0,0.52381,0.941176,0.933333,0.5,1.0,0.90625,0.464286,0.961538,0.846154,0.111111
male,1.0,1.0,0.363636,0.666667,0.0,0.103448,0.333333,0.047619,0.115385,0.513514,0.071429,0.172043,0.28,0.095238,0.064516


<a id = "24"></a><br>
<h1 style="background-color:#77BFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Reading Data </h1>

The **read_csv(**) function in pandas is used to read data from a CSV (Comma-Separated Values) file and create a DataFrame. It is one of the most commonly used functions in pandas for importing data from external sources. Here is the syntax of the read_csv() function with its common arguments:

In [240]:
# pd.read_csv(
#     filepath_or_buffer, # The path or URL to the CSV file or a file-like object containing the CSV data.
#     sep=',',            # The delimiter used to separate values in the CSV file. Default is ','.
#     header='infer',     # The row number(s) to use as column names. Default is 'infer'.
#     names=None,         # A list of column names to use. If not specified, header row(s) will be used as column names.
#     index_col=None,     # The column(s) to use as the index (row labels) of the DataFrame. Default is None.
#     usecols=None,       # A list of column names to read from the file. Default is None (read all columns).
#     dtype=None,         # Data type to force for specified columns.
#     parse_dates=False,  # List of column names to parse as dates. Default is False (no date parsing).
#     na_values=None,     # Additional strings to recognize as NaN (Not a Number) values.
#     skiprows=None,      # Number of rows to skip at the beginning of the file.
#     nrows=None,         # Number of rows to read from the file.
#     skip_blank_lines=True,  # If True, skip over blank lines rather than interpreting them as NaN values.
#     encoding='utf-8',   # File encoding to use (e.g., 'utf-8', 'latin-1', etc.). Default is 'utf-8'.
#     thousands=None,     # Character to use as thousands separator when parsing numeric data.
#     decimal='.',        # Character to use as the decimal point when parsing numeric data.
#     header=None,        # The row number(s) to use as the column names (deprecated; use `header='infer'` instead).
#     comment=None,       # Character(s) to use as a comment indicator. Lines starting with this character will be ignored.
#     skipfooter=0,       # Number of lines to skip at the end of the file.
#     engine='c',         # Parser engine to use. The 'c' engine is faster, while the 'python' engine is more feature-rich.
# )

<a id = "25"></a><br>
<h1 style="background-color:#77BFC7;font-family:courier;font-size:250%;font-style: oblique;font-weight: bold;font-variant: small-caps;text-align:center;border-radius: 15px 50px;"> Rule Based Classification </h1>

In [241]:
df = pd.read_csv("/kaggle/input/persona/week_02_data.csv")

* **First look**

In [242]:
df.head()

Unnamed: 0,PRICE,SOURCE,SEX,COUNTRY,AGE
0,39,android,male,bra,17
1,39,android,male,bra,17
2,49,android,male,bra,17
3,29,android,male,tur,17
4,49,android,male,tur,17


In [243]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   PRICE    5000 non-null   int64 
 1   SOURCE   5000 non-null   object
 2   SEX      5000 non-null   object
 3   COUNTRY  5000 non-null   object
 4   AGE      5000 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 234.4+ KB


In [244]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PRICE,5000.0,34.132,12.464897,9.0,29.0,39.0,39.0,59.0
AGE,5000.0,23.5814,8.995908,15.0,17.0,21.0,27.0,66.0


In [245]:
df.isnull().sum()

PRICE      0
SOURCE     0
SEX        0
COUNTRY    0
AGE        0
dtype: int64

In [246]:
df.describe(include=[object]).T

Unnamed: 0,count,unique,top,freq
SOURCE,5000,2,android,2974
SEX,5000,2,female,2621
COUNTRY,5000,6,usa,2065


* **How many unique SOURCE in data?**

In [247]:
df["SOURCE"].value_counts()

android    2974
ios        2026
Name: SOURCE, dtype: int64

In [248]:
df["SOURCE"].mode()

0    android
Name: SOURCE, dtype: object

* **How many unique PRICE in data?**

In [249]:
df["PRICE"].nunique()

6

* **Sales quantities of PRICE quantities?**

In [250]:
df["PRICE"].value_counts()

29    1305
39    1260
49    1031
19     992
59     212
9      200
Name: PRICE, dtype: int64

* **How many sales were made from which country?**

In [251]:
df["COUNTRY"].value_counts()

usa    2065
bra    1496
deu     455
tur     451
fra     303
can     230
Name: COUNTRY, dtype: int64

* **How much profit was made from total sales by country?**

In [252]:
df.groupby("COUNTRY").agg({"PRICE": "sum"})

Unnamed: 0_level_0,PRICE
COUNTRY,Unnamed: 1_level_1
bra,51354
can,7730
deu,15485
fra,10177
tur,15689
usa,70225


* **What are the sales figures by RESOURCE types?**

In [253]:
df.groupby("SOURCE")["SOURCE"].agg([("Sales_Amount","count")])

Unnamed: 0_level_0,Sales_Amount
SOURCE,Unnamed: 1_level_1
android,2974
ios,2026


* **What are the PRICE averages by country?**

In [254]:
df.groupby("COUNTRY")["PRICE"].agg([("Price_Averages","mean")])

Unnamed: 0_level_0,Price_Averages
COUNTRY,Unnamed: 1_level_1
bra,34.32754
can,33.608696
deu,34.032967
fra,33.587459
tur,34.78714
usa,34.007264


* **What are the PRICE averages according to SOURCEs?**

In [255]:
df.groupby("SOURCE").agg({"PRICE": "mean"})

Unnamed: 0_level_0,PRICE
SOURCE,Unnamed: 1_level_1
android,34.174849
ios,34.069102


* **What are the PRICE averages in the COUNTRY-SOURCE breakdown?**

In [256]:
df.groupby(["COUNTRY","SOURCE"]).agg({"PRICE": "mean"})

Unnamed: 0_level_0,Unnamed: 1_level_0,PRICE
COUNTRY,SOURCE,Unnamed: 2_level_1
bra,android,34.387029
bra,ios,34.222222
can,android,33.330709
can,ios,33.951456
deu,android,33.869888
deu,ios,34.268817
fra,android,34.3125
fra,ios,32.776224
tur,android,36.229437
tur,ios,33.272727


* **What are the average earnings in the COUNTRY, SOURCE, SEX, AGE breakdown?**

In [257]:
df.groupby(["COUNTRY","SOURCE","SEX","AGE"]).agg({"PRICE": "mean"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,PRICE
COUNTRY,SOURCE,SEX,AGE,Unnamed: 4_level_1
bra,android,female,15,38.714286
bra,android,female,16,35.944444
bra,android,female,17,35.666667
bra,android,female,18,32.255814
bra,android,female,19,35.206897
...,...,...,...,...
usa,ios,male,42,30.250000
usa,ios,male,50,39.000000
usa,ios,male,53,34.000000
usa,ios,male,55,29.000000


In [258]:
[df.groupby(["COUNTRY","SOURCE","SEX","AGE"])[["PRICE"]].mean().apply(lambda x : x.loc[[i]].head(5)) for i in set(df.COUNTRY)]

[                                PRICE
 COUNTRY SOURCE  SEX    AGE           
 can     android female 15   25.666667
                        16   29.689655
                        18   37.333333
                        19   32.529412
                        20   31.500000,
                                 PRICE
 COUNTRY SOURCE  SEX    AGE           
 fra     android female 15   37.571429
                        16   36.727273
                        17   34.652174
                        19   27.333333
                        21   39.000000,
                                 PRICE
 COUNTRY SOURCE  SEX    AGE           
 bra     android female 15   38.714286
                        16   35.944444
                        17   35.666667
                        18   32.255814
                        19   35.206897,
                                 PRICE
 COUNTRY SOURCE  SEX    AGE           
 deu     android female 15   32.000000
                        16   27.000000
                      

* **Sort the output by PRICE.**

In [259]:
agg_df = df.groupby(["COUNTRY","SOURCE","SEX","AGE"]).agg({"PRICE": "mean"}).sort_values(by="PRICE",ascending=False)

* **Convert the names in the directory to variable names.**

In [260]:
agg_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,PRICE
COUNTRY,SOURCE,SEX,AGE,Unnamed: 4_level_1
bra,android,male,46,59.0
usa,android,male,36,59.0
fra,android,female,24,59.0
usa,ios,male,32,54.0
deu,android,female,36,49.0


In [261]:
agg_df.reset_index(inplace=True)

In [262]:
agg_df.head()

Unnamed: 0,COUNTRY,SOURCE,SEX,AGE,PRICE
0,bra,android,male,46,59.0
1,usa,android,male,36,59.0
2,fra,android,female,24,59.0
3,usa,ios,male,32,54.0
4,deu,android,female,36,49.0


* **Convert the AGE variable to a categorical variable and add it to agg_df.**

In [263]:
agg_df["AGE_CAT"] = pd.cut(agg_df["AGE"],bins=[0,18,23,30,40,agg_df["AGE"].max()],
                                        labels=['0_18', '19_23', '24_30', '31_40', f'41_{agg_df["AGE"].max()}'])

In [264]:
agg_df["AGE_CAT"].value_counts()

24_30    78
0_18     77
19_23    71
31_40    65
41_66    57
Name: AGE_CAT, dtype: int64

* **Define new level based customers and add them to the data set as a variable.**

In [265]:
agg_df["customers_level_based"] = [f"{COUNTRY}_{SOURCE}_{SEX}_{AGE_CAT}".upper()
                                 for COUNTRY,SOURCE,SEX,AGE_CAT
                                 in zip(agg_df["COUNTRY"],agg_df["SOURCE"],agg_df["SEX"],agg_df["AGE_CAT"])]

In [266]:
agg_df = agg_df.groupby("customers_level_based").agg({"PRICE":"mean"})

* **Segment new customers (USA_ANDROID_MALE_0_18).**

In [267]:
agg_df["SEGMENT"] = pd.qcut(agg_df["PRICE"],q=4,labels=["D","C","B","A"])

* **Categorize new customers and estimate how much revenue they could bring in**.

In [268]:
reseted_df = agg_df.reset_index()

In [269]:
new_customer = "TUR_ANDROID_FEMALE_31_40"

In [270]:
agg_df.loc[new_customer]

PRICE      41.833333
SEGMENT            A
Name: TUR_ANDROID_FEMALE_31_40, dtype: object

In [271]:
reseted_df[reseted_df["customers_level_based"]==new_customer][["PRICE","SEGMENT"]]

Unnamed: 0,PRICE,SEGMENT
72,41.833333,A


* **All in One Function**

In [272]:
def all_in_one(df , customer):
    df = df.groupby(["COUNTRY","SOURCE","SEX","AGE"]).agg({"PRICE": "mean"}).sort_values(by="PRICE",ascending=False).reset_index()
    df["AGE_CAT"] = pd.cut(df["AGE"], bins=[0, 18, 23, 30, 40, df["AGE"].max()],
                                        labels=['0_18', '19_23', '24_30', '31_40', f'41_{df["AGE"].max()}'])
    df["customers_level_based"] = [f"{i[0]}_{i[1]}_{i[2]}_{i[5]}".upper() for i in df.values]
    df = df.groupby("customers_level_based").agg({"PRICE": "mean"})
    df["SEGMENT"] = pd.qcut(df["PRICE"], q=4, labels=["D", "C", "B", "A"])
    df.reset_index(inplace=True)
    result = df[df["customers_level_based"] == customer][["PRICE", "SEGMENT"]]
    return (f"{new_customer} Average Income: {result.iloc[0][0].round(2)}, Segment: {result.iloc[0][1]}")


new_customer = "TUR_ANDROID_FEMALE_31_40"
new_customer = "FRA_IOS_FEMALE_31_40"

all_in_one(df,new_customer)

'FRA_IOS_FEMALE_31_40 Average Income: 32.82, Segment: C'