<a href="https://colab.research.google.com/github/venkateswaran-online/Scaler-Lecture-Notes/blob/main/AR_DAV1_Notes_Advanced_Numpy%2C_Pandas%2C_plotting_Lecture_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **1. Introduction to the Dataset & Business Context**

<table align="center" width="100%">
    <tr>
        <td width="35%">
            <img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/089/303/original/Cost-Estimation-to-develop-a-Restaurant-App-like-Zomato.jpg?1726057958">
        </td>
        <td>
            <div align="center">
                <font color="#e66e82" size="5">
                    <b>Bangalore Restaurant Market Analysis</b>
                </font>
            </div>
        </td>
    </tr>
</table>

### About the Dataset and Business Case

**Dataset:** <font color="violet">**Zomato Bangalore**</font> 🍽  
We are exploring restaurant data from **Zomato**, focusing on <font color="blue">Bangalore</font>. By analyzing this data, we can understand:

- Distribution of <font color="green">**online ordering**</font> & <font color="green">**table bookings**</font>.
- Patterns in <font color="purple">**ratings**</font>, <font color="orange">**votes**</font>, and <font color="brown">**cost**</font>.
- Popular <font color="red">**cuisines**</font> and their impact on business strategies.

**Key Business Objectives:**
1. <font color="teal">**Optimize Offerings & Online Presence**</font>: Identify patterns in service availability.
2. <font color="teal">**Enhance Customer Experience**</font>: Uncover top-rated areas and improve satisfaction.
3. <font color="teal">**Increase Revenue**</font>: Target locations with high votes and premium pricing segments.
4. <font color="teal">**Leverage Popular Cuisines**</font>: Focus on in-demand cuisines for strategic planning.

We’ll use this dataset as a running example to demonstrate techniques in data analysis and, later, how <font color="magenta">**NumPy**</font> can help us handle numerical data efficiently.


For Students: Open this Sheet to take a look at the Data [link](https://docs.google.com/spreadsheets/d/1zcVsqjSWrY67_X7kbOIMIenuNVkoLyEeEz7NNly-WrY)

<font color="blue">**Note:**</font> Here is our initial look at the data. We see columns like:
- <font color="orange">`votes`</font> (customer feedback intensity)
- <font color="purple">`rate`</font> (restaurant ratings)
- <font color="red">`cuisines`</font> & <font color="brown">`approx_cost(for two people)`</font> (pricing info)

This gives us context for the analyses we’ll perform.

---

## **2. Introduction to Data Analysis & Visualization (DAV)**

### What is DAV? 💡

**Data Analysis & Visualization (DAV)** is about using tools like <font color="green">NumPy</font>, <font color="green">Pandas</font>, <font color="green">Matplotlib</font>, and <font color="green">Seaborn</font> to understand, process, and present data meaningfully.

### **Q&A on Zomato Dataset Insights:**

1. **Q:** What is the overall average rating?  
   **Answer:** 3.9

2. **Q:** How many restaurants are in the dataset?  
   **Answer:** 3159

3. **Q:** How many restaurants offer online ordering vs. those that do not?  
   **Answer:**  
   **Online Ordering Availability:**
   ```
   online_order
   Yes    16358
   No      6835
   ```

4. **Q:** Which areas have the highest-rated restaurants?  
   **(Top 10 Locations by Average Rating)**  
   **Answer:**  
   ![Highest Rated Restaurants](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/100/467/original/highest-rated_restaurants.png?1734428912)

5. **Q:** How does the average cost for two people vary by location?  
   **Answer:**  
   ![Average Cost for Two People](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/100/468/original/average_cost_for_two_people.png?1734428927)

6. **Q:** Which cuisines are more popular based on votes and ratings?  
   **Answer:**  
   ![Votes and Ratings](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/100/469/original/votes_and_ratings.png?1734428939)

**These are the kinds of questions we aim to address throughout the DAV module.** By the end of this module, you’ll be able to use NumPy for efficient computations, Pandas for data manipulation, and Matplotlib/Seaborn for visual insights—empowering you to answer practical, business-driven questions like the ones above.

For **Zomato**, DAV helps us:
- Identify top-rated restaurants by area or cuisine.
- Understand how pricing influences customer decisions.
- Make data-driven improvements to restaurant offerings and marketing strategies.

As we proceed, we’ll see how each tool and technique contributes to answering these questions and making sense of the data."

In [None]:
import numpy as np

# Extracting numeric data as NumPy arrays for future analysis
votes = np.array([ 775,  787,  918,   88,  166,  286, 2556,  324,  504,  402])
costs = np.array(["'800.0'" ,"'800.0'", "'800.0'", "'300.0'", "'600.0'", "'600.0'", "'600.0'", "'700.0'" ,"'550.0'", "'500.0'"])

print("Votes (Array):", votes)
print("Costs (Array):", costs)

Votes (Array): [ 775  787  918   88  166  286 2556  324  504  402]
Costs (Array): ["'800.0'" "'800.0'" "'800.0'" "'300.0'" "'600.0'" "'600.0'" "'600.0'"
 "'700.0'" "'550.0'" "'500.0'"]


These <font color="magenta">NumPy arrays</font> (`votes`, `costs`) are now ready for fast calculations, statistics, and transformations. We’ll soon see why arrays are central to efficient analysis.




---


## **3. Python Lists vs NumPy Arrays**

### Why Use NumPy Arrays? 🚀

<font color="magenta">**NumPy**</font> is designed for:
- **Speed & Performance**: Vectorized operations run in optimized C code under the hood.
- **Memory Efficiency**: Contiguous data storage means less overhead, better cache utilization.
- **Vectorized Operations**: Apply an operation to all elements without explicit loops.
- **Time Complexity Advantages**: Bulk operations are straightforward and often faster than pure Python loops.


<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/995/original/download.png?1706870327" width=700 height=175>


In contrast, <font color="red">**Python lists**</font>:
- Are flexible but slower for large numeric computations.
- Lack built-in methods for fast vectorized math.

**Key Terms:**
- <font color="blue">**Performance**</font>
- <font color="blue">**Memory Efficiency**</font>
- <font color="blue">**Vectorization**</font>
- <font color="blue">**Scalability**</font>


**Example:** Consider approximate costs for a few restaurants.


In [None]:
import time
import numpy as np

# Large list and array to compare performance
large_list = list(range(1_000_000))
large_array = np.arange(1_000_000)

# Timing sum on a Python list
start = time.time()
sum_list = sum(large_list)
end = time.time()
list_time = end - start

# Timing sum on a NumPy array
start = time.time()
sum_array = np.sum(large_array)
end = time.time()
array_time = end - start

print("List sum time:", list_time, "seconds")
print("Array sum time:", array_time, "seconds")

List sum time: 0.006956338882446289 seconds
Array sum time: 0.0009212493896484375 seconds


You’ll typically see a noticeable performance difference. Even basic operations can be faster with <font color="magenta">**NumPy arrays**</font>.

Another example: scaling all `votes` by 2 with NumPy.

In [None]:
votes_times_two = votes * 2
votes_times_two

array([1550, 1574, 1836,  176,  332,  572, 5112,  648, 1008,  804])

No loops needed. Just a simple expression!

As we proceed, we’ll dive deeper into array operations, dimensions, indexing, slicing, and aggregations—all made smoother, faster, and more intuitive thanks to **NumPy**.



---



## **4. Dimensions & Shape**

### Understanding Dimensions & Shape 🔍

**NumPy arrays** can represent data in multiple dimensions:
- **1D arrays**: Like a simple list of votes.
- **2D arrays**: Think rows (restaurants) and columns (attributes).
- **nD arrays**: Higher dimensions for more complex data.

We use:
- <font color="magenta">`.shape`</font> to see (rows, columns)
- <font color="magenta">`.ndim`</font> to see how many dimensions
- <font color="magenta">`.size`</font> to count total elements

Let’s inspect the dimensions of our `votes` and `costs` arrays.


In [None]:
print("Votes array shape:", votes.shape)
print("Votes array dimensions:", votes.ndim)
print("Votes array size:", votes.size)

print("Costs array shape:", costs.shape)
print("Costs array dimensions:", costs.ndim)
print("Costs array size:", costs.size)

Votes array shape: (10,)
Votes array dimensions: 1
Votes array size: 10
Costs array shape: (10,)
Costs array dimensions: 1
Costs array size: 10


- These are both 1D arrays.

Now, let’s create a **2D array** example using a small portion of `votes` and `costs`.

In [None]:
# Take first 5 elements of votes and costs
subset_votes = votes
subset_costs = costs

# Create a 2D array: 5 rows, 2 columns (each row: [vote_count, cost])
two_d_data = np.array([
    subset_votes,
    subset_costs
]).T  # transpose so that each row corresponds to a single restaurant

print("2D Array:\n", two_d_data)
print("Shape:", two_d_data.shape)
print("Dimensions:", two_d_data.ndim)
print("Size:", two_d_data.size)

2D Array:
 [['775' "'800.0'"]
 ['787' "'800.0'"]
 ['918' "'800.0'"]
 ['88' "'300.0'"]
 ['166' "'600.0'"]
 ['286' "'600.0'"]
 ['2556' "'600.0'"]
 ['324' "'700.0'"]
 ['504' "'550.0'"]
 ['402' "'500.0'"]]
Shape: (10, 2)
Dimensions: 2
Size: 20


We now have a 2D array where each row is a restaurant (limited sample), and columns represent different attributes (votes, cost).

<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/064/921/original/download_%284%29.png?1707852012">

We can even imagine a 3D array, but we’ll keep it simple for now. The key idea: dimensions define how data is structured for analysis.


### **`np.arange()`**

Let's create some sequences in  Numpy.

We can pass **starting** point, **ending** point (not included in the array) and **step-size**.

**Syntax:**
- `arange(start, end, step)`



In [None]:
arr2 = np.arange(1, 5)
arr2

array([1, 2, 3, 4])

In [None]:
arr2_step = np.arange(1, 5, 2)
arr2_step

array([1, 3])

`np.arange()` behaves in the same way as `range()` function.

**But then why not call it np.range?**

- In `np.arange()`, we can pass a **floating point number** as **step-size**.

In [None]:
arr3 = np.arange(1, 5, 0.5)
arr3

array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])



---

## **5. Type Conversion in NumPy Arrays**


### Type Conversion ⚙️

In data analysis, numeric calculations require data to be in a proper numeric format.  
If a column is read as an "object" (often a string) but represents numeric values, we must convert it.

**Key points:**
- Use <font color="magenta">`.astype()`</font> to change data types.
- Ensuring a numeric data type allows for mathematical operations (e.g., mean, sum).
- Different data types (int, float) have implications for how values are stored and interpreted.
- Mixed types (e.g., strings and integers in the same array) can cause issues or forced conversions.

**Real-World Example:**
- The `approx_cost(for two people)` column may appear as strings like `"1,200"` instead of a numeric value. We need to convert it to `int` or `float` for calculations.

In [None]:
# STEP 1: Inspect the current dtype of 'approx_cost(for two people)'
print("Column dtype before conversion:", costs.dtype)

Column dtype before conversion: <U7


In [None]:
costs

array(["'800.0'", "'800.0'", "'800.0'", "'300.0'", "'600.0'", "'600.0'",
       "'600.0'", "'700.0'", "'550.0'", "'500.0'"], dtype='<U7')

As you can see there are some "," that we should remove before converting them into float so that they do not give any error

In [None]:
# STEP 2: Remove commas and single quotes
costs = np.char.replace(costs, ',', '')  # Remove commas np.char is similar to str.replace we use in python
costs = np.char.replace(costs, "'", '')  # Remove single quotes

# STEP 3: Convert the cleaned strings to float
costs = costs.astype(float)

# STEP 4: Confirm the dtype
print("Array dtype after conversion:", costs.dtype)
print("Cleaned costs:", costs)

Array dtype after conversion: float64
Cleaned costs: [800. 800. 800. 300. 600. 600. 600. 700. 550. 500.]


- We started with a column likely read as "object" (strings).
- Removed commas to ensure values look like "1200" not "1,200".
- Converted the cleaned string to an integer type.
- Now `costs` is a proper integer array, ready for numeric analysis.


In [None]:
# Convert costs to float
costs_float = costs.astype(float)

print("Costs as float, dtype:", costs_float.dtype)
print("First 5 costs (float):", costs_float[:5])

Costs as float, dtype: float64
First 5 costs (float): [800. 800. 800. 300. 600.]


- Now all values are floating-point numbers, which is useful for operations that require decimals or non-integers.

**What if the data were mixed types?**
For example, consider this small array:


In [None]:
mixed_data = np.array([100, '200', 300, 'Hundred'])
print("Mixed data array:", mixed_data)
print("Mixed data dtype:", mixed_data.dtype)

Mixed data array: ['100' '200' '300' 'Hundred']
Mixed data dtype: <U21


- Notice that if an array contains at least one string, NumPy will often interpret the entire array as a string type (`<U3` means Unicode string).
- To do math, we must convert these strings to numeric types.
- If conversion fails (e.g., a non-numeric string), it will raise an error.

**Example: Converting mixed_data to int:**


In [None]:
# Convert the string values to int
mixed_int = mixed_data.astype(int)
print("Converted mixed data dtype:", mixed_int.dtype)
print("Mixed data as int:", mixed_int)

ValueError: invalid literal for int() with base 10: 'Hundred'

- Now all elements are integers, enabling numeric operations.

**Key Takeaways:**
- Always ensure the data type matches the intended operation.
- Use `.astype()` to convert strings to integers/floats.
- Clean the data first (e.g., remove commas).
- Mixed-type arrays become string arrays, so convert them if you need numeric calculations.

With properly typed arrays, we can perform statistical calculations, filtering, and aggregations without issues.

---

## **6. Indexing & Slicing**

### Indexing & Slicing 🎯

**Indexing** lets us select specific elements, while **slicing** extracts subsets.

- **Basic indexing**: `array[index]`
- **Slicing**: `array[start:end]`

Examples:
1. First 5 votes.
2. A slice of the first 10 costs.
3. For our 2D array, select the first 3 rows.


In [None]:
votes[-1] # negative indexing in numpy array

402

In [None]:
votes[0] # gives first element of array

775

You can also use list of indexes in numpy.

In [None]:
votes[[2,3,4,1,2,2]]

array([918,  88, 166, 787, 918, 918])

In [None]:
print("First 5 votes:", votes[:5])
print("First 10 costs:", costs[:10])

# 2D array slicing (two_d_data from above)
print("First 3 rows of the 2D array:\n", two_d_data[:3, :])

First 5 votes: [775 787 918  88 166]
First 10 costs: [800. 800. 800. 300. 600. 600. 600. 700. 550. 500.]
First 3 rows of the 2D array:
 [['775' "'800.0'"]
 ['787' "'800.0'"]
 ['918' "'800.0'"]]


- We can also slice columns. For example, `two_d_data[:, 0]` gives all votes, `two_d_data[:, 1]` gives all costs.
- Indexing & slicing help us focus on relevant subsets, like top-rated restaurants or certain cost ranges.

### Fancy Indexing (Masking)

- Numpy arrays can be indexed with boolean arrays (masks).
- This method is called **fancy indexing** or **masking**.

\
What would happen if we do this?


In [None]:
ten_votes = votes[:10]

print(ten_votes)

votes[:10] < 800

[ 775  787  918   88  166  286 2556  324  504  402]


array([ True,  True, False,  True,  True,  True, False,  True,  True,
        True])

**Comparison operation also happens on each element**.
- All the values before 800 return `True`
- All the values after 800 return `False`



In [None]:
ten_votes[[True,  True,  True,  True,  True, False, False, False, False, False]]

array([775, 787, 918,  88, 166])

Notice that we are passing a list of indices.
- For every instance of `True`, it will print the corresponding index.
- Conversely, for every `False`, it will skip the corresponding index, and not print it.

So, this becomes a **filter** of sorts.

Now, let's use this to filter or mask values from our array.

**Condition will be passed instead of indices and slice ranges.**

In [None]:
ten_votes[ten_votes < 700]

array([ 88, 166, 286, 324, 504, 402])

This is known as Fancy Indexing in Numpy.

In [None]:
ten_votes[ten_votes%3 == 0]

array([ 918, 2556,  324,  504,  402])

---

## **7. Working with 2D Arrays (Matrices)**

### Working with 2D Arrays (Matrices) 🧩

**2D arrays** represent tabular data, like a matrix of restaurants (rows) and attributes (columns).

We already have numeric arrays like `votes` and `costs`. Let’s create a small **2D array** using a subset of these columns. Each row represents one restaurant, and each column a numeric attribute.


In [None]:
# Take a sample of 50 restaurants
sample_votes = np.array([775, 787, 918, 88, 166, 286, 2556, 324, 504, 402, 150, 164, 424, 918, 90, 133, 144, 93, 62, 180, 62, 148, 219, 506, 172, 415, 230, 1647, 4884, 133, 286, 540, 2556, 36, 244, 804, 679, 245, 345, 618, 1047, 627, 354, 244, 163, 808, 1720, 868, 520, 299])
sample_costs = np.array([800.0, 800.0, 800.0, 300.0, 600.0, 600.0, 600.0, 700.0, 550.0, 500.0, 600.0, 500.0, 450.0, 800.0, 650.0, 800.0, 700.0, 300.0, 400.0, 500.0, 600.0, 550.0, 600.0, 500.0, 750.0, 500.0, 650.0, 600.0, 750.0, 200.0, 500.0, 800.0, 600.0, 400.0, 300.0, 450.0, 850.0, 300.0, 400.0, 750.0, 450.0, 450.0, 800.0, 800.0, 800.0, 850.0, 400.0, 1200.0, 300.0, 300.0])

# Create a 2D array: rows = restaurants, columns = [votes, costs]
restaurants_data = np.column_stack((sample_votes, sample_costs))

print("2D Array (votes, costs):\n", restaurants_data)
print("Shape:", restaurants_data.shape)
print("Dimensions:", restaurants_data.ndim)  # 2D

2D Array (votes, costs):
 [[ 775.  800.]
 [ 787.  800.]
 [ 918.  800.]
 [  88.  300.]
 [ 166.  600.]
 [ 286.  600.]
 [2556.  600.]
 [ 324.  700.]
 [ 504.  550.]
 [ 402.  500.]
 [ 150.  600.]
 [ 164.  500.]
 [ 424.  450.]
 [ 918.  800.]
 [  90.  650.]
 [ 133.  800.]
 [ 144.  700.]
 [  93.  300.]
 [  62.  400.]
 [ 180.  500.]
 [  62.  600.]
 [ 148.  550.]
 [ 219.  600.]
 [ 506.  500.]
 [ 172.  750.]
 [ 415.  500.]
 [ 230.  650.]
 [1647.  600.]
 [4884.  750.]
 [ 133.  200.]
 [ 286.  500.]
 [ 540.  800.]
 [2556.  600.]
 [  36.  400.]
 [ 244.  300.]
 [ 804.  450.]
 [ 679.  850.]
 [ 245.  300.]
 [ 345.  400.]
 [ 618.  750.]
 [1047.  450.]
 [ 627.  450.]
 [ 354.  800.]
 [ 244.  800.]
 [ 163.  800.]
 [ 808.  850.]
 [1720.  400.]
 [ 868. 1200.]
 [ 520.  300.]
 [ 299.  300.]]
Shape: (50, 2)
Dimensions: 2


#### How can we convert this array to a 4-dimensional array?

- Using `reshape()`

For a 4D array, we will have to specify the followings :-
- **First argument** is **no. of rows**
- **Second argument** is **no. of columns**

\
Let's try converting it into a `25x4` array.

In [None]:
restaurants_data.reshape(25, 4)

restaurants_data

array([[ 775.,  800.],
       [ 787.,  800.],
       [ 918.,  800.],
       [  88.,  300.],
       [ 166.,  600.],
       [ 286.,  600.],
       [2556.,  600.],
       [ 324.,  700.],
       [ 504.,  550.],
       [ 402.,  500.],
       [ 150.,  600.],
       [ 164.,  500.],
       [ 424.,  450.],
       [ 918.,  800.],
       [  90.,  650.],
       [ 133.,  800.],
       [ 144.,  700.],
       [  93.,  300.],
       [  62.,  400.],
       [ 180.,  500.],
       [  62.,  600.],
       [ 148.,  550.],
       [ 219.,  600.],
       [ 506.,  500.],
       [ 172.,  750.],
       [ 415.,  500.],
       [ 230.,  650.],
       [1647.,  600.],
       [4884.,  750.],
       [ 133.,  200.],
       [ 286.,  500.],
       [ 540.,  800.],
       [2556.,  600.],
       [  36.,  400.],
       [ 244.,  300.],
       [ 804.,  450.],
       [ 679.,  850.],
       [ 245.,  300.],
       [ 345.,  400.],
       [ 618.,  750.],
       [1047.,  450.],
       [ 627.,  450.],
       [ 354.,  800.],
       [ 24

In [None]:
restaurants_data.reshape(10, 10)

array([[ 775.,  800.,  787.,  800.,  918.,  800.,   88.,  300.,  166.,
         600.],
       [ 286.,  600., 2556.,  600.,  324.,  700.,  504.,  550.,  402.,
         500.],
       [ 150.,  600.,  164.,  500.,  424.,  450.,  918.,  800.,   90.,
         650.],
       [ 133.,  800.,  144.,  700.,   93.,  300.,   62.,  400.,  180.,
         500.],
       [  62.,  600.,  148.,  550.,  219.,  600.,  506.,  500.,  172.,
         750.],
       [ 415.,  500.,  230.,  650., 1647.,  600., 4884.,  750.,  133.,
         200.],
       [ 286.,  500.,  540.,  800., 2556.,  600.,   36.,  400.,  244.,
         300.],
       [ 804.,  450.,  679.,  850.,  245.,  300.,  345.,  400.,  618.,
         750.],
       [1047.,  450.,  627.,  450.,  354.,  800.,  244.,  800.,  163.,
         800.],
       [ 808.,  850., 1720.,  400.,  868., 1200.,  520.,  300.,  299.,
         300.]])

In [None]:
restaurants_data.reshape(50, 4)

ValueError: cannot reshape array of size 100 into shape (50,4)

**This will give an Error. Why?**

* We have 100 elements in `restaurants_data`, but `reshape(50, 4)` is trying to fill in `50x4 = 200` elements.
* Therefore, whatever the shape we're trying to reshape to, must be able to incorporate the number of elements that we have.

In [None]:
restaurants_data.reshape(5, -1)

array([[ 775.,  800.,  787.,  800.,  918.,  800.,   88.,  300.,  166.,
         600.,  286.,  600., 2556.,  600.,  324.,  700.,  504.,  550.,
         402.,  500.],
       [ 150.,  600.,  164.,  500.,  424.,  450.,  918.,  800.,   90.,
         650.,  133.,  800.,  144.,  700.,   93.,  300.,   62.,  400.,
         180.,  500.],
       [  62.,  600.,  148.,  550.,  219.,  600.,  506.,  500.,  172.,
         750.,  415.,  500.,  230.,  650., 1647.,  600., 4884.,  750.,
         133.,  200.],
       [ 286.,  500.,  540.,  800., 2556.,  600.,   36.,  400.,  244.,
         300.,  804.,  450.,  679.,  850.,  245.,  300.,  345.,  400.,
         618.,  750.],
       [1047.,  450.,  627.,  450.,  354.,  800.,  244.,  800.,  163.,
         800.,  808.,  850., 1720.,  400.,  868., 1200.,  520.,  300.,
         299.,  300.]])

Notice that Python automatically figured out what should be the replacement of `-1` argument, given that the first argument is `8`.

We can also put `-1` as the first argument. As long as one argument is given, it will calculate the other one.

**Transpose:**  
- Use <font color="magenta">`.T`</font> to flip rows and columns.

**Indexing & Slicing in 2D:**  
- `restaurants_data[0, :]` = first row (all columns)
- `restaurants_data[:, 1]` = second column (all rows)
- `restaurants_data[2:5, :]` = rows 2 to 4

<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/054/693/original/2dnp.png?1697949471 height = "600" width = "700">

In [None]:
print("Transpose of the 2D array:\n", restaurants_data.T)
print("First row:", restaurants_data[0, :])
print("All costs (second column):", restaurants_data[:, 1])
print("Rows 2 to 4:\n", restaurants_data[2:5, :])

Transpose of the 2D array:
 [[ 775.  787.  918.   88.  166.  286. 2556.  324.  504.  402.  150.  164.
   424.  918.   90.  133.  144.   93.   62.  180.   62.  148.  219.  506.
   172.  415.  230. 1647. 4884.  133.  286.  540. 2556.   36.  244.  804.
   679.  245.  345.  618. 1047.  627.  354.  244.  163.  808. 1720.  868.
   520.  299.]
 [ 800.  800.  800.  300.  600.  600.  600.  700.  550.  500.  600.  500.
   450.  800.  650.  800.  700.  300.  400.  500.  600.  550.  600.  500.
   750.  500.  650.  600.  750.  200.  500.  800.  600.  400.  300.  450.
   850.  300.  400.  750.  450.  450.  800.  800.  800.  850.  400. 1200.
   300.  300.]]
First row: [775. 800.]
All costs (second column): [ 800.  800.  800.  300.  600.  600.  600.  700.  550.  500.  600.  500.
  450.  800.  650.  800.  700.  300.  400.  500.  600.  550.  600.  500.
  750.  500.  650.  600.  750.  200.  500.  800.  600.  400.  300.  450.
  850.  300.  400.  750.  450.  450.  800.  800.  800.  850.  400. 1200.
  300. 

**Fancy Indexing (Masking):**  
- Filter rows based on a condition. For example, restaurants with costs > 500.


In [None]:
mask_high_cost = restaurants_data[:, 1] > 500
high_cost_restaurants = restaurants_data[mask_high_cost]

print("High cost restaurants (cost > 500):\n", high_cost_restaurants)

High cost restaurants (cost > 500):
 [[ 775.  800.]
 [ 787.  800.]
 [ 918.  800.]
 [ 166.  600.]
 [ 286.  600.]
 [2556.  600.]
 [ 324.  700.]
 [ 504.  550.]
 [ 150.  600.]
 [ 918.  800.]
 [  90.  650.]
 [ 133.  800.]
 [ 144.  700.]
 [  62.  600.]
 [ 148.  550.]
 [ 219.  600.]
 [ 172.  750.]
 [ 230.  650.]
 [1647.  600.]
 [4884.  750.]
 [ 540.  800.]
 [2556.  600.]
 [ 679.  850.]
 [ 618.  750.]
 [ 354.  800.]
 [ 244.  800.]
 [ 163.  800.]
 [ 808.  850.]
 [ 868. 1200.]]


- We can easily select subsets based on conditions, enabling targeted analysis.




---

## **8. Aggregate Functions**

**Aggregate functions** summarize data 🔢:
- `np.sum()` to total votes
- `np.mean()` to find average cost
- `np.min()/np.max()` to find min/max rating or cost
- `np.std()` to measure spread

We’ll use these on our numeric arrays to quickly derive insights.


In [None]:
# Total votes of these 10 restaurants
total_votes = np.sum(restaurants_data[:, 0])

# Average cost for these 10 restaurants
avg_cost = np.mean(restaurants_data[:, 1])

# Max votes, Min cost
max_votes = np.max(restaurants_data[:, 0])
min_cost = np.min(restaurants_data[:, 1])

print("Total Votes:", total_votes)
print("Average Cost:", avg_cost)
print("Max Votes:", max_votes)
print("Min Cost:", min_cost)

Total Votes: 30583.0
Average Cost: 587.0
Max Votes: 4884.0
Min Cost: 200.0


- Aggregates help us understand overall trends quickly.
- For the full dataset (not just the sample), we could apply the same functions to the entire `votes` or `costs` arrays.
- This is crucial for highlighting overall patterns (e.g., average spend in the city or total engagement through votes).

Now we can confidently summarize data, enabling high-level business insights.


In [None]:
data = np.arange(12).reshape(3, 4)
data

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

#### What if we want to do the elements row-wise or column-wise?

- By **setting `axis` parameter**

#### What will `np.sum(a, axis=0)` do?

- `np.sum(a, axis=0)` adds together values in **different rows**
- `axis = 0` $\rightarrow$ **Changes will happen along the vertical axis**
- Summation of values happen **in the vertical direction**.
- Rows collapse/merge when we do `axis=0`.

In [None]:
np.sum(data, axis=0)

array([12, 15, 18, 21])

#### What if we specify `axis=1`?

- `np.sum(a, axis=1)` adds together values in **different columns**
- `axis = 1` $\rightarrow$ **Changes will happen along the horizontal axis**
- Summation of values happen **in the horizontal direction**.
- Columns collapse/merge when we do `axis=1`.

In [None]:
np.sum(data, axis=1)

array([ 6, 22, 38])



---

## **9. Logical Operations**

**Logical operations** in NumPy help us filter and query data based on conditions. 🤔

- <font color="magenta">`np.where(condition)`</font>: Returns indices where the condition is True.
- <font color="magenta">`np.any(condition)`</font>: Checks if **any** elements satisfy a condition.
- <font color="magenta">`np.all(condition)`</font>: Checks if **all** elements satisfy a condition.

**Use Cases:**
- Find restaurants with certain attributes (e.g., cost above a threshold).
- Check if at least one restaurant meets a condition (`np.any()`).
- Check if all restaurants meet a certain standard (`np.all()`).


In [None]:
# Example: We have the 'costs' array from before
# Let's find indices where cost > 1000
high_cost_indices = np.where(costs > 1000)
print("Indices of restaurants with cost > 1000:", high_cost_indices)

Indices of restaurants with cost > 1000: (array([], dtype=int64),)


In [None]:
# Are there any restaurants with cost > 3000?
any_above_3000 = np.any(costs > 3000)
print("Any restaurant cost above 3000?", any_above_3000)

Any restaurant cost above 3000? False


In [None]:
# Are all restaurants cheaper than 5000?
all_below_5000 = np.all(costs < 5000)
print("All restaurants cost below 5000?", all_below_5000)

All restaurants cost below 5000? True


In [None]:
# Using np.where to directly select values
selected_costs = costs[np.where((costs > 500) & (costs < 1000))]
print("Costs between 500 and 1000:", selected_costs)

Costs between 500 and 1000: [800. 800. 800. 600. 600. 600. 700. 550.]


- `np.where()` gives us flexibility in selecting elements or their indices based on a condition.
- `np.any()` and `np.all()` quickly inform us about the existence or universality of a condition across the dataset.
- These tools are critical for filtering data before applying further analysis or visualization.

With logical operations, we can focus on the subsets of data that matter, speeding up our decision-making and insights discovery.



---

