# Week 2: NumPy & Pandas
### Fall ’25 Data Science Bootcamp
**Shaahid Ahmed Nadeem**

# Numpy vs Pandas

| **NumPy** | **Pandas** |
|----------|------------|
| When we have to work on numerical data, we prefer the NumPy module. | When we have to work on tabular data, we prefer the pandas module. |
| The powerful tool of NumPy is Arrays. | The powerful tools of pandas are DataFrame and Series. |
| NumPy is memory efficient. | Pandas consume more memory. |
| NumPy has better performance when the number of rows is 50K or less. | Pandas have better performance when the number of rows is 500K or more. |
| Indexing of NumPy arrays is very fast. | Indexing of the Pandas series is very slow as compared to NumPy arrays. |
| NumPy is capable of providing multi-dimensional arrays. | Pandas have a 2D table object called DataFrame. |
| It has a lower industry application. | It has a higher industry application. |


# The Basics of NumPy and Pandas

**Numpy** is the core library for scientific computing in Python. It provides a high-performance multidimensional **array object**, and tools for working with these arrays. If you are already familiar with MATLAB, you might find this [tutorial](http://wiki.scipy.org/NumPy_for_Matlab_Users) useful to get started with Numpy.

### Why use Numpy Arrays instead of Lists?


In [None]:
l = [1,"Rohan", 1.23]

*   Data types: Arrays in NumPy are **homogeneous**, meaning all elements must be of the same data type (e.g., integers, floats), whereas lists can contain elements of different data types.

*   Memory efficiency: Arrays are more **memory efficient** compared to lists because they store data in a contiguous block of memory. Lists, on the other hand, store references to objects in memory, which can result in more memory overhead.

*   Performance: Operations on arrays are generally faster and more efficient than operations on lists, especially for large datasets, due to **NumPy's implementation in C** and optimized algorithms.

*   Functionality: NumPy arrays come with a wide range of **built-in functions and methods** for mathematical operations, linear algebra, statistical analysis, and more. Lists have a more limited set of built-in functions and methods.




To use Numpy, we first need to import the numpy package. By convention, we import it using the alias np. Then, when we want to use modules or methods in this library, we preface them with np.



In [2]:
import numpy as np

### Arrays and array construction

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

We can create a `numpy` array by passing a Python list to `np.array()`.

In [3]:
a = np.array([1, 2, 3])  # Create a rank 1 array

This creates the array we can see on the right here:

![](http://jalammar.github.io/images/numpy/create-numpy-array-1.png)

In [6]:
print(type(a), a.shape, a[0], a[1], a[2])
a[0] = 5                 # Change an element of the array
print(a)

<class 'numpy.ndarray'> (3,) 5 2 3
[5 2 3]


To create a `numpy` array with more dimensions, we can pass nested lists, like this:

![](http://jalammar.github.io/images/numpy/numpy-array-create-2d.png)

![](http://jalammar.github.io/images/numpy/numpy-3d-array.png)

In [10]:
z = np.array([1,2,3])
print(z.shape)

(3,)


In [14]:
b = np.array([[1,2],[3,4]])   # Create a rank 2 array
print(b)
print(b.shape)

[[1 2]
 [3 4]]
(2, 2)


In [15]:
c=np.array([[[1,2],[3,4]],[[5,6],[7,8]]])
print(c)
print(c.shape)

[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]
(2, 2, 2)


There are often cases when we want numpy to initialize the values of the array for us. numpy provides methods like `ones()`, `zeros()`, and `random.random()` for these cases. We just pass them the number of elements we want it to generate:

![](http://jalammar.github.io/images/numpy/create-numpy-array-ones-zeros-random.png)

Sometimes, we need an array of a specific shape with “placeholder” values that we plan to fill in with the result of a computation. The `zeros` or `ones` functions are handy for this:

In [19]:
a = np.ones(3)  # Create an array of all ones
print(a)
b=np.random.random(3)
print(b)

[1. 1. 1.]
[0.91512135 0.41354155 0.59036575]


We can also use these methods to produce multi-dimensional arrays, as long as we pass them a tuple describing the dimensions of the matrix we want to create:

![](http://jalammar.github.io/images/numpy/numpy-matrix-ones-zeros-random.png)



In [25]:
a = np.zeros((2,2))  # Create an array of all zeros
print(a)
b = np.ones((1,2))   # Create an array of all ones
print(b)
c = np.random.random((2,2)) # Create an array filled with random values
print(c)

[[0. 0.]
 [0. 0.]]
[[1. 1.]]
[[0.48918122 0.25830757]
 [0.37054373 0.43507359]]


l [start:end:step_size]

Numpy also has two useful functions for creating sequences of numbers: `arange` and `linspace`.

The `arange` function accepts three arguments, which define the start value, stop value of a half-open interval, and step size. (The default step size, if not explicitly specified, is 1; the default start value, if not explicitly specified, is 0.)

The `linspace` function is similar, but we can specify the number of values instead of the step size, and it will create a sequence of evenly spaced values.

In [33]:
f = np.arange(10,50,5)   # Create an array of values starting at 10 in increments of 5
print(f)

[10 15 20 25 30 35 40 45]


In [34]:
g = np.linspace(10, 45, num=8)
print(g)

[10. 15. 20. 25. 30. 35. 40. 45.]


Note this ends on 45, not 50 (does not include the top end of the interval).

In [35]:
g = np.linspace(0., 1., num=5)
print(g)

[0.   0.25 0.5  0.75 1.  ]


In [37]:
# Using linspace
linspace_array = np.linspace(0, 10, 5)
print("Linspace Array:", linspace_array)

# # Using arange
arange_array = np.arange(0, 10, 2)
print("Arange Array:", arange_array)

Linspace Array: [ 0.   2.5  5.   7.5 10. ]
Arange Array: [0 2 4 6 8]


Sometimes, we may want to construct an array from existing arrays by “stacking” the existing arrays, either vertically or horizontally. We can use `vstack()` (or `row_stack`) and `hstack()` (or `column_stack`), respectively.

In [42]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.vstack((a,b))
print(c)



[[1 2 3]
 [4 5 6]]


In [48]:
arr = np.array([1,2,3])
print(arr)
print(arr.shape)

[1 2 3]
(3,)


In [52]:
a = np.array([[7], [8], [9]])
print(a.shape)
b = np.array([[4], [5], [6]])
c = np.hstack((a,b))
print(c)
print(c.shape)

(3, 1)
[[7 4]
 [8 5]
 [9 6]]
(3, 2)


## NumPy Array Attributes

In [53]:
x1=np.array([5,6,7,8,9])
print(x1.shape)
print(x1)

(5,)
[5 6 7 8 9]


Each array has attributes ``ndim`` (the number of dimensions), ``shape`` (the size of each dimension), and ``size`` (the total size of the array):

In [56]:
a = np.array([[5],[4],[3],[2],[1]])
print(a.shape)
print(a.ndim)
print(a.size)

(5, 1)
2
5


In [58]:
print("x1 ndim: ", x1.ndim)
print("x1 shape:", x1.shape)

x2 = np.random.random((3, 4))  # Two-dimensional array
print("x2 shape:", x2.shape)
print("x2 size: ", x2.size)
print("x2 ndim: ", x2.ndim)


x1 ndim:  1
x1 shape: (5,)
x2 shape: (3, 4)
x2 size:  12
x2 ndim:  2


#### Datatypes

Every numpy array is a grid of elements of the same type. Numpy provides a large set of numeric datatypes that you can use to construct arrays. Numpy tries to guess a datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype. Here is an example:

In [65]:
x = np.array([1, 2])  # Let numpy choose the datatype
y = np.array([1.0, 2.0])  # Let numpy choose the datatype
z = np.array([1, 2], dtype=np.int64)  # Force a particular datatype

print(x.dtype, y.dtype, z.dtype)

int64 float64 int64


You can read all about numpy datatypes in the [documentation](http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html).

## Array Indexing: Accessing Single Elements

If you are familiar with Python's standard list indexing, indexing in NumPy will feel quite familiar.
In a one-dimensional array, the $i^{th}$ value (counting from zero) can be accessed by specifying the desired index in square brackets, just as with Python lists:

In [66]:
x1

array([5, 6, 7, 8, 9])

In [68]:
x1[0]

5

In [69]:
x1[4]

9

To index from the end of the array, you can use negative indices:

In [70]:
x1[-1]

9

In [71]:
x1[-2]

8

In a multi-dimensional array, items can be accessed using a comma-separated tuple of indices:

In [73]:
x2

array([[0.09150415, 0.83639033, 0.37243961, 0.12569436],
       [0.33241635, 0.36283559, 0.12105199, 0.83854716],
       [0.18915545, 0.99572522, 0.65118276, 0.49443012]])

In [75]:
x2[0, 0]

0.09150414512440053

In [76]:
x2[2, 0]

0.1891554464907399

In [77]:
x2[2, -1]

0.4944301234727836

Values can also be modified using any of the above index notation:

In [78]:
x2[0][0]=12

In [79]:
print(x2)

[[12.          0.83639033  0.37243961  0.12569436]
 [ 0.33241635  0.36283559  0.12105199  0.83854716]
 [ 0.18915545  0.99572522  0.65118276  0.49443012]]


In [80]:
x2[0, 0] = 12
x2

array([[12.        ,  0.83639033,  0.37243961,  0.12569436],
       [ 0.33241635,  0.36283559,  0.12105199,  0.83854716],
       [ 0.18915545,  0.99572522,  0.65118276,  0.49443012]])

Keep in mind that, unlike Python lists, NumPy arrays have a fixed type.
This means, for example, that if you attempt to insert a floating-point value to an integer array, the value will be silently truncated. Don't be caught unaware by this behavior!

In [82]:
print(x1)
print(x2)
x2[0,0] = 4   # this will be truncated!
x2

[5 6 7 8 9]
[[12.          0.83639033  0.37243961  0.12569436]
 [ 0.33241635  0.36283559  0.12105199  0.83854716]
 [ 0.18915545  0.99572522  0.65118276  0.49443012]]


array([[4.        , 0.83639033, 0.37243961, 0.12569436],
       [0.33241635, 0.36283559, 0.12105199, 0.83854716],
       [0.18915545, 0.99572522, 0.65118276, 0.49443012]])

## Array Slicing: Accessing Subarrays

Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the *slice* notation, marked by the colon (``:``) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array ``x``, use this:
``` python
x[start:stop:step]
```
If any of these are unspecified, they default to the values ``start=0``, ``stop=``*``size of dimension``*, ``step=1``.
We'll take a look at accessing sub-arrays in one dimension and in multiple dimensions.

### One-dimensional subarrays

In [83]:
l=[1,2,3,4,5]
print(l[:])

[1, 2, 3, 4, 5]


In [84]:
x = np.arange(20)
x

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [85]:
x[:5]  # first five elements

array([0, 1, 2, 3, 4])

In [86]:
x[5:]  # elements after index 5

array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [87]:
x[4:7]  # middle sub-array

array([4, 5, 6])

In [88]:
x[::2]  # every other element  # What if I want odd numbers?

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

A potentially confusing case is when the ``step`` value is negative.
In this case, the defaults for ``start`` and ``stop`` are swapped.
This becomes a convenient way to reverse an array:

In [89]:
x[::-1]  # all elements, reversed

array([19, 18, 17, 16, 15, 14, 13, 12, 11, 10,  9,  8,  7,  6,  5,  4,  3,
        2,  1,  0])

### Multi-dimensional subarrays

Multi-dimensional slices work in the same way, with multiple slices separated by commas.
For example:

In [90]:
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x2


array([[1, 0, 3, 8],
       [9, 5, 7, 7],
       [3, 7, 7, 1]])

In [91]:
x2[:2, :3]  # two rows, three columns
# x2[:2][:3]

array([[1, 0, 3],
       [9, 5, 7]])

In [92]:
x2[:3, ::2]  # all rows, every other column

array([[1, 3],
       [9, 7],
       [3, 7]])

Finally, subarray dimensions can even be reversed together:

In [93]:
x2[::-1, ::-1]

array([[1, 7, 7, 3],
       [7, 7, 5, 9],
       [8, 3, 0, 1]])

#### Accessing array rows and columns

One commonly needed routine is accessing of single rows or columns of an array.
This can be done by combining indexing and slicing, using an empty slice marked by a single colon (``:``):

In [94]:
print(x2[:, 0])  # first column of x2

[1 9 3]


In [95]:
print(x2[0, :])  # first row of x2

[1 0 3 8]


In the case of row access, the empty slice can be omitted for a more compact syntax:

In [97]:
print(x2[0])  # equivalent to x2[0,:]

[1 0 3 8]


### Subarrays as no-copy views

One important–and extremely useful–thing to know about array slices is that they return *views* rather than *copies* of the array data.
This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies.
Consider our two-dimensional array from before:

In [98]:
print(x2)

[[1 0 3 8]
 [9 5 7 7]
 [3 7 7 1]]


Let's extract a $2 \times 2$ subarray from this:

In [99]:
x2_sub = x2[:2, :2]
print(x2_sub)

[[1 0]
 [9 5]]


Now if we modify this subarray, we'll see that the original array is changed! Observe:

In [100]:
x2_sub[0, 0] = 99
print(x2_sub)

[[99  0]
 [ 9  5]]


In [101]:
print(x2)

[[99  0  3  8]
 [ 9  5  7  7]
 [ 3  7  7  1]]


This default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer.

### Creating copies of arrays

Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the ``copy()`` method:

In [102]:
x2_sub_copy = x2[:2, :2].copy()
print(x2_sub_copy)

[[99  0]
 [ 9  5]]


If we now modify this subarray, the original array is not touched:

In [103]:
x2_sub_copy[0, 0] = 42
print(x2_sub_copy)

[[42  0]
 [ 9  5]]


In [104]:
print(x2)

[[99  0  3  8]
 [ 9  5  7  7]
 [ 3  7  7  1]]


## Reshaping of Arrays

Another useful type of operation is reshaping of arrays.
The most flexible way of doing this is with the ``reshape`` method.
For example, if you want to put the numbers 1 through 9 in a $3 \times 3$ grid, you can do the following:

In [None]:
grid = np.arange(1, 10).reshape((3,3))
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [111]:
grid = np.arange(1,10)
print(grid)
grid = grid.reshape((3,3))
print(grid)

[1 2 3 4 5 6 7 8 9]
[[1 2 3]
 [4 5 6]
 [7 8 9]]


Note that for this to work, the size of the initial array must match the size of the reshaped array.

Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or column matrix.
This can be done with the ``reshape`` method, or more easily done by making use of the ``newaxis`` keyword within a slice operation:

In [116]:
x = np.array([1, 2, 3])
# row vector via reshape
print(x.shape)
print(x)
x = x.reshape((3,1))
print(x.shape)
print(x)

(3,)
[1 2 3]
(3, 1)
[[1]
 [2]
 [3]]


(3,)

(1,3)

In [120]:
x = np.array([1, 2, 3])
print(x.shape)
# row vector via newaxis
x = x[:,np.newaxis]   #It adds an axis of length 1 to the array, effectively increasing its dimensionality.
print(x)
print(x.shape)


(3,)
[[1]
 [2]
 [3]]
(3, 1)


In [121]:
x = np.array([1, 2, 3])
# column vector via reshape
x.reshape((3, 1))

array([[1],
       [2],
       [3]])

In [122]:
# column vector via newaxis
x = np.array([1, 2, 3])
x=x[:, np.newaxis]
print(x.shape)

(3, 1)


In [129]:
# column vector via newaxis
x = np.array([1, 2, 3])
print(x.shape)
x=x[np.newaxis,np.newaxis,:]
print(x)
print(x.shape)

(3,)
[[[1 2 3]]]
(1, 1, 3)


You will come across such transformations throughout any data science problem.


## Array Concatenation and Splitting

All of the preceding routines worked on single arrays. It's also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays. We'll take a look at those operations here.

### Concatenation of arrays

Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routines ``np.concatenate``, ``np.vstack``, and ``np.hstack``.
``np.concatenate`` takes a tuple or list of arrays as its first argument, as we can see here:

In [130]:
a = "Rohan"+"chopra"
print(a)

Rohanchopra


In [131]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

array([1, 2, 3, 3, 2, 1])

You can also concatenate more than two arrays at once:

In [132]:
z = [99, 99, 99]
print(np.concatenate([x, y, z]))

[ 1  2  3  3  2  1 99 99 99]


It can also be used for two-dimensional arrays:

In [133]:
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])

print(grid.shape)

(2, 3)


1 2 3
4 5 6
1 2 3
4 5 6

In [136]:
# concatenate along the first axis
new_grid = np.concatenate([grid, grid], axis=0)
print(new_grid)
new_grid.shape

[[1 2 3]
 [4 5 6]
 [1 2 3]
 [4 5 6]]


(4, 3)

In [137]:
# concatenate along the second axis (zero-indexed)
new_grid_col = np.concatenate([grid, grid], axis=1)
print(new_grid_col)


new_grid_col.shape

[[1 2 3 1 2 3]
 [4 5 6 4 5 6]]


(2, 6)

Numpy also provides many useful functions for performing computations on arrays, such as `min()`, `max()`, `sum()`, and others:

![](http://jalammar.github.io/images/numpy/numpy-matrix-aggregation-1.png)

In [138]:
data = np.array([[1, 2], [3, 4], [5, 6]])

print(np.max(data))  # Compute max of all elements; prints "6"
print(np.min(data))  # Compute min of all elements; prints "1"
print(np.sum(data))  # Compute sum of all elements; prints "21"

6
1
21


Not only can we aggregate all the values in a matrix using these functions, but we can also aggregate across the rows or columns by using the `axis` parameter:

![](http://jalammar.github.io/images/numpy/numpy-matrix-aggregation-4.png)

In [139]:
data = np.array([[1, 2], [5, 3], [4, 6]])

print(np.max(data, axis=0))  # Compute max of each column; prints "[5 6]"
print(np.max(data, axis=1))  # Compute max of each row; prints "[2 5 6]"

[5 6]
[2 5 6]


In [None]:
condition ? action1 : action2

In [141]:
a = np.arange(10)
print(a)
np.where(a < 5, a, 10*a)

[0 1 2 3 4 5 6 7 8 9]


array([ 0,  1,  2,  3,  4, 50, 60, 70, 80, 90])

Let's look at a practical example using the numpy attributes we just discussed 💻

## Practice Questions for numpy
1. Define two custom numpy arrays, say A and B. Generate two new numpy arrays by stacking A and B vertically and horizontally.
2. Find common elements between A and B. [Hint : Intersection of two sets]
3. Extract all numbers from A which are within a specific range. eg between 5 and 10. [Hint: np.where() might be useful or boolean masks]
4. Filter the rows of iris_2d that has petallength (3rd column) > 1.5 and sepallength (1st column) < 5.0
```
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
```

In [None]:
## Optional Practice Question

#Find the mean of a numeric column grouped by a categorical column in a 2D numpy array

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')


numeric_column = iris[:, 1].astype('float')  # sepalwidth
grouping_column = iris[:, 4]  # species

output = []
"""Your code goes here"""

output

## Starting with Pandas

Pandas is a powerful and popular open-source Python library used for data manipulation (cleaning, filtering, sorting, reshaping, restructuring, aggregating, joining) and analysis. It provides data structures and functions designed to make working with structured (tabular) data easy and intuitive.

Check out the [documentation](https://pandas.pydata.org/docs/reference/index.html) as you code.

In [None]:
#we start from the very basics...import!

import pandas as pd
import numpy as np

The **DataFrame** is a two-dimensional labeled data structure with columns of potentially different types, similar to a spreadsheet or SQL table. It provides powerful indexing, slicing, and reshaping capabilities, making it easy to manipulate and analyze data.

## PART 1: Getting and Knowing your Data


0,1,item_name

In [None]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'

chipo = pd.read_csv(url, sep=',')

### See the first 10 entries

In [None]:
chipo.head(10)  #Returns the first 10 rows

### Print the last elements of the data set.

In [None]:
chipo.tail()   #Returns the last 5 rows

### What is the number of observations in the dataset?

In [None]:
chipo.shape    #Return a tuple representing the dimensionality of the DataFrame.   #How to access no of rows and columns?

### Another way

In [None]:
chipo.info()

### What is the number of columns in the dataset?

In [None]:
chipo.shape[1]

### What are the different columns in our dataset?

In [None]:
chipo.columns

In [None]:
chipo.index

### How many items were orderd in total?

In [None]:
total_items_orders = chipo.quantity.sum()
total_items_orders

### Check the item price type

In [None]:
chipo.item_price.dtype
# It is a python object

How much was the revenue for the period in the dataset?

In [None]:
chipo['item_price'] = chipo['item_price'].str[1:]
chipo['item_price'] = pd.to_numeric(chipo['item_price'])

In [None]:
chipo

In [None]:
revenue = (chipo['quantity']* chipo['item_price'])
revenue = revenue.sum()

print('Revenue was: $' + str(np.round(revenue,2)))

### How many orders were made in the period?

In [None]:
orders = chipo.order_id.value_counts().count()     # value_counts: returns the frequency count of unique values in the 'order_id' column of the DataFrame chipo
orders

### How many different items are sold?


In [None]:
chipo.item_name.value_counts().count()

## PART B: Filtering and Sorting Data

### What is the price of each item?


In [None]:
chipo[(chipo['item_name'] == 'Chicken Bowl') & (chipo['quantity'] == 1)]

### Sort by the name of the item

In [None]:
chipo.item_name.sort_values()

### OR

In [None]:
chipo.sort_values(by = "item_name")

### What was the quantity of the most expensive item ordered?

In [None]:
chipo.sort_values(by = "item_price", ascending = False).head(1)


### How many times was a Veggie Salad Bowl ordered?

In [None]:
chipo_salad = chipo[chipo.item_name == "Veggie Salad Bowl"]
len(chipo_salad)

### Trying some different dataset

In [None]:
drinks = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv')
drinks.head()

### Which continent drinks more beer on average?

In [None]:
drinks.groupby('continent').beer_servings.mean()

### For each continent print the statistics for wine consumption.

In [None]:
drinks.groupby('continent').wine_servings.describe()

### Print the mean, min and max values for spirit consumption.

In [None]:
drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])

### Trying some more different functionalities

In [None]:
csv_url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/Students_Alcohol_Consumption/student-mat.csv'
df = pd.read_csv(csv_url)
stud_alcoh = df.loc[: , "school":"guardian"]
stud_alcoh.head()

In [None]:
capitalizer = lambda q: q.capitalize()  #A lambda function in Python is a small anonymous function that can have any number of arguments, but can only have one expression.
                                        #They are defined using the lambda keyword, followed by a list of arguments, a colon, and then the expression to be evaluated.
                                        # Lambda functions are often used when you need a simple function for a short period of time.

In [None]:
stud_alcoh['Mjob'].apply(capitalizer)
stud_alcoh['Fjob'].apply(capitalizer)
stud_alcoh.tail()

In [None]:
stud_alcoh['Mjob'] = stud_alcoh['Mjob'].apply(capitalizer)
stud_alcoh['Fjob'] = stud_alcoh['Fjob'].apply(capitalizer)
stud_alcoh.tail()

### Here instead of just using the existing the data, we will create our own dataframe/dataseries

**pd.Series** is a one-dimensional labeled array-like data structure in pandas. It can hold data of any type (integers, floats, strings, etc.) and is similar to a **one-dimensional NumPy array** or a Python list. However, unlike a NumPy array, a pd.Series can have *custom row labels*, which are referred to as the index.

In [None]:
a = pd.Series([1,2,3])
a

You can specify custom index labels for a pandas Series by passing a list of index labels to the index parameter when creating the Series.

In [None]:
data = [10, 20, 30]
custom_index = ['A', 'B', 'C']
s = pd.Series(data, index=custom_index)
s

In [None]:
print('Data passed as a list')
df_list = pd.DataFrame([['May1', 32], ['May2', 35], ['May3', 40], ['May4', 50]])
print(df_list)

In [None]:
print('Data passed as dictionary')
df_dict = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]},dtype = float)
print(df_dict)

In [None]:
# Rename columns

df_dict.rename(columns={'A': 'a'})

# inplace by default is false
# if inplace = True is not set then the changes are not made on the original df but only a temp df is made with changes
df_dict

In [None]:
# changes made for original df

df_dict.rename(columns={'A': 'a'}, inplace=True)
df_dict


In [None]:
# Reset column names
# Tip: remember to pass the entire list in this case

df_dict.columns = ['a', 'b']
df_dict.head()

In [None]:
# Defining columns, index during dataframe creation

df_temp = pd.DataFrame([['October 1', 67], ['October 2', 72], ['October 3', 58], ['October 4', 69], ['October 5', 77]], index = ['Day 1', 'Day 2', 'Day 3', 'Day 4', 'Day 5'], columns = ['Month', 'Temperature'])
df_temp


## Practice Questions for Pandas

1. From df filter the 'Manufacturer', 'Model' and 'Type' for every 20th row starting from 1st (row 0).

```
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
```

2. Replace missing values in Min.Price and Max.Price columns with their respective mean.

```
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
```

3. How to get the rows of a dataframe with row sum > 100?

```
df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))
```