## ⚙️ Setup

- Ensure the Python kernel has the necessary libraries: `pandas`, `matplotlib` and `lets-plot`.
- Ensure the `bakery.csv` file is in the `data` folder.

**Imports**

(It is a good practice to import ALL the libraries you will be using at the start of your notebook)

In [1]:
import numpy as np
import pandas as pd

from numerize import numerize as nz

from lets_plot import *
LetsPlot.setup_html()

# 1. Reading & tidying up the data a bit

In [None]:
filename = '../data//bakery.csv' 
df = pd.read_csv(filename)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 485 entries, 0 to 484
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   data-product-id        485 non-null    int64 
 1   data-product-name      485 non-null    object
 2   data-product-type      485 non-null    object
 3   data-product-on-offer  485 non-null    bool  
 4   data-product-index     485 non-null    int64 
 5   image-url              485 non-null    object
 6   product-page           485 non-null    object
 7   product-name           485 non-null    object
 8   product-size           482 non-null    object
 9   item-price             485 non-null    object
 10  price-per-unit         463 non-null    object
 11  offer-description      52 non-null     object
 12  category               485 non-null    object
dtypes: bool(1), int64(2), object(10)
memory usage: 46.1+ KB


Dropping unnecessary columns and renaming the columns for better understanding:

In [3]:
# Drop duplicates
df = df.drop_duplicates()

df = df.drop(columns=['data-product-name', 
                      'data-product-type', 
                      'data-product-index', 
                      'category'])
df = (
    df.rename(columns={
        'data-product-id': 'id',
        'data-product-price': 'price',
        'data-product-on-offer': 'offer',
        'product-page': 'page',
        'product-name': 'name',
        'product-size': 'size',
    })
)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 482 entries, 0 to 484
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 482 non-null    int64 
 1   offer              482 non-null    bool  
 2   image-url          482 non-null    object
 3   page               482 non-null    object
 4   name               482 non-null    object
 5   size               479 non-null    object
 6   item-price         482 non-null    object
 7   price-per-unit     460 non-null    object
 8   offer-description  49 non-null     object
dtypes: bool(1), int64(1), object(7)
memory usage: 34.4+ KB


Changing types of columns:

In [None]:
df['id'] = df['id'].astype('int32')

## 1.1 Fixing the `item-price` column

The `item-price` column is a string, it has things like `£3.00` and `60p`. Before we can convert it to a number, we need to remove the `£` and `p` symbols and convert it to a number.


Follow my Live Demo as I build the rationale for the solution you see below. 

In [5]:
df.loc[df['item-price'].str.contains('p'), 'item-price'] = df['item-price'].apply(lambda x: '0.' + str.replace(x, 'p', ''))
df.loc[df['item-price'].str.contains('£'), 'item-price'] = df['item-price'].str.replace('£', '')
df['item-price'] = df['item-price'].astype('float')

# 2. EDA by means of curiosity-driven questions

## Q1: What is the distribution of prices in the Waitrose Bakery section?

In [6]:
#T = transpose row to coloumn
df['item-price'].describe().to_frame().T 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
item-price,482.0,4.84668,7.750208,0.5,1.6,2.2,3.15,45.0


Let's say I'm interested in understanding the price of 🍞 bread products:

## Q2: How many bread products are there in the dataset?

In [7]:
all_bread = df['name'].str.contains('bread', case=False)

print(f"There are {all_bread.sum()} bread products in the dataset.")

There are 52 bread products in the dataset.


## Q3: Are they all truly bread? Or do I have some other products with the string `'bread'` in the name?

In [8]:
# Follow my live demo to understand the process of writing the code below.
df[all_bread][['name', 'size', 'item-price', 'page']].sort_values(['name', 'size']).set_index(['page']).head(5)

Unnamed: 0_level_0,name,size,item-price
page,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
https://www.waitrose.com/ecom/products/all-butter-shortbread/056597-28405-28406,All Butter Shortbread,each,1.2
https://www.waitrose.com/ecom/products/bfree-gluten-free-wholegrain-pitta-breads/657631-695118-695119,BFree Gluten Free Wholegrain Pitta Breads,4x55g,2.9
https://www.waitrose.com/ecom/products/bacheldr-rustic-crunch-bread-mix/559129-220498-220499,Bacheldr Rustic Crunch Bread Mix,500g,1.5
https://www.waitrose.com/ecom/products/cohens-bakery-buckingham-rye-bread/077133-39207-39208,Cohens Bakery Buckingham Rye Bread,400g,2.1
https://www.waitrose.com/ecom/products/crosta-mollica-piadina-flatbreads/817933-198092-198093,Crosta & Mollica Piadina Flatbreads,3s,2.0


🧑‍⚖️ **DECISION:** 

- Remove 'shortbreads'
- Remove 'flatbread'
- Remove 'bread mix'
- Remove 'pitta bread'

In [9]:
breads_to_remove = ['shortbread', 'flatbread', 'bread mix', 'pitta bread', 'gingerbread']

# Rebuild all_breads to exclude the breads_to_remove
all_bread = df['name'].str.contains('bread', case=False) & ~df['name'].str.contains('|'.join(breads_to_remove), case=False)


print(f"There are {all_bread.sum()} bread products in the dataset.")

There are 31 bread products in the dataset.


In [10]:
df_bread = df[all_bread].sort_values(['name', 'size'])

df_bread[['name', 'size', 'item-price', 'page']].set_index(['page']).head(5)

Unnamed: 0_level_0,name,size,item-price
page,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
https://www.waitrose.com/ecom/products/cohens-bakery-buckingham-rye-bread/077133-39207-39208,Cohens Bakery Buckingham Rye Bread,400g,2.1
https://www.waitrose.com/ecom/products/essential-white-medium-sliced-bread/055018-27631-27632,Essential White Medium Sliced Bread,800g,0.75
https://www.waitrose.com/ecom/products/essential-wholemeal-medium-sliced-bread/055051-27670-27671,Essential Wholemeal Medium Sliced Bread,800g,0.75
https://www.waitrose.com/ecom/products/hovis-1886-granary-sliced-bread/841477-746623-746624,Hovis 1886 Granary Sliced Bread,450g,1.5
https://www.waitrose.com/ecom/products/hovis-1886-seeded-sliced-bread/531774-746615-746616,Hovis 1886 Seeded Sliced Bread,450g,1.5


## Q4: Which sizes are available for each bread product?

⭐️ GET READY FOR YOUR FIRST `groupby()`!

- Follow my live demo closely as I explain the difference between `pd.Series` and `pd.DataFrame` as they explain the output of the code below. 
- You will also learn about the `apply()` method (not shown here yet)

In [11]:
# This is the simpler solution, but why does it look odd and different to the data we've been seeing in previous steps?
df_bread.groupby('name')['size'].unique().head(5)

name
Cohens Bakery Buckingham Rye Bread         [400g]
Essential White Medium Sliced Bread        [800g]
Essential Wholemeal Medium Sliced Bread    [800g]
Hovis 1886 Granary Sliced Bread            [450g]
Hovis 1886 Seeded Sliced Bread             [450g]
Name: size, dtype: object

## Q5: How many sizes are available for each bread product?

What if I want a count, not the sizes themselves?

In [12]:
df_bread.groupby('name')['size'].nunique().head(5)

name
Cohens Bakery Buckingham Rye Bread         1
Essential White Medium Sliced Bread        1
Essential Wholemeal Medium Sliced Bread    1
Hovis 1886 Granary Sliced Bread            1
Hovis 1886 Seeded Sliced Bread             1
Name: size, dtype: int64

**🎯 ACTION POINTS:**

- Create a solution where you have two columns: `available_sizes` and `num_sizes`
- Sort the resulting DataFrame by `num_sizes` in descending order

<div style="margin-left:2em;padding-left:1em;font-size:0.75em;width:40%">

💡 **HINT:** To create a new column in pandas, use the following syntax:

```python
df['new_column'] = df['old_column'].apply(lambda x: x + 1)
```

</div>

In [13]:
# Decending order
availl = df_bread.groupby('name')['size'].unique().reset_index()
num_size = df_bread.groupby('name')['size'].nunique().reset_index()
availl['num_size']= num_size['size']
availl.sort_values(by ='num_size',ascending=False).head(5)

Unnamed: 0,name,size,num_size
12,Hovis Wholemeal Medium Sliced Bread,"[400g, 800g]",2
9,Hovis Seed Sensations Multiseeded Sliced Bread,"[400g, 800g]",2
2,Essential Wholemeal Medium Sliced Bread,[800g],1
1,Essential White Medium Sliced Bread,[800g],1
0,Cohens Bakery Buckingham Rye Bread,[400g],1


In [14]:
#lambda method 
dum= df_bread.groupby('name').apply(lambda x : pd.Series({
    'available_size':x['size'].unique(),
    'num_size':len(x['size'].unique())
})).reset_index()

dum.sort_values(by = "num_size",ascending=False).head(5)

  dum= df_bread.groupby('name').apply(lambda x : pd.Series({


Unnamed: 0,name,available_size,num_size
12,Hovis Wholemeal Medium Sliced Bread,"[400g, 800g]",2
9,Hovis Seed Sensations Multiseeded Sliced Bread,"[400g, 800g]",2
2,Essential Wholemeal Medium Sliced Bread,[800g],1
1,Essential White Medium Sliced Bread,[800g],1
0,Cohens Bakery Buckingham Rye Bread,[400g],1


## Q6: How different are the prices of sliced vs unsliced breads?

**🎯 ACTION POINTS:**

Now let's do something more complex! 

<span style="display:block;margin-left:1.5em;font-size:0.85em;">If you manage to solve this, then you would have already built the skills to solve the 💻 [Week 01 Day 01 Lab](https://lse-dsi.github.io/ME204/2024/weeks/week01/day01/lab.html) - 🎁 [Bonus Tasks (Challenge)](https://lse-dsi.github.io/ME204/2024/weeks/week01/day01/lab.html#bonus-tasks)!</span>

<div style="display:inline-flex;flex-wrap:wrap;flex-direction:row;width:80%;margin-left:0.5em">

<div style="width:400px;height:260px;border-radius:1em;margin:1%;padding:1.5%;background-color:#fafafa">

<h2>Is it sliced?</h2>

Create a new column on the DataFrame of breads called `is_sliced` and fill it with `True` if the product is sliced and `False` otherwise.

</div>

<div style="width:400px;height:260px;border-radius:1em;margin:1%;padding:1.5%;background-color:#fafafa">

<h2>Price per kg</h2>

Create a new column on the DataFrame of breads called `price-per-kg`. 

Check if `price-per-unit` is suitable. If not, replace it with the price per 100g of the product.

</div>
<!-- 
<div style="width:400px;height:460px;border-radius:1em;margin:1%;padding:1.5%;background-color:#fafafa">

<h2>Variety</h2>

Create a new column on the DataFrame of breads called `variety` and fill it with the variety of the product. 

For example, the variety of 

> "Hovis Wholemeal Medium Sliced Bread" 

is "Wholemeal Medium Sliced Bread" 

And the variety of 

> "Irwin's Together Malted Grain Bread" 

is "Malted Grain Bread"

</div>

<div style="width:400px;height:460px;border-radius:1em;margin:1%;padding:1.5%;background-color:#fafafa">

<h2>Brand</h2>

Create a new column on the DataFrame of breads and call it `brand`.  Fill it with the brand of the product. 

For example, the brand of 

> Hovis Wholemeal Medium Sliced Bread

is "Hovis"

And the brand of 

> Irwin's Together Malted Grain Bread 

is "Irwin's Together"

</div> -->

</div>


**Then, compare the distribution of prices of sliced vs unsliced breads.**

<div style="color:transparent;background-color:transparent">

<details style="height:0.1em"><summary></summary>

Shhh 🤫, here is the solution:


(
    df_bread.assign(is_sliced=lambda x: x['name'].str.contains('sliced', case=False),
                    price_per_kg=lambda x: x['item-price'] / x['size'].str.replace('g', '').astype('float') * 1000)
            .rename(columns={'price_per_kg': 'price-per-kg'})
            .groupby('is_sliced')
            .apply(lambda x: x['price-per-kg'].describe(), include_groups=False)
)

</details>
</div>

In [None]:
df_bread['is_sliced'] = df_bread['name'].str.contains("sliced bread",case=False)
df_bread['is_sliced']=df_bread['is_sliced'].replace(True,"Sliced").replace(False,"Unsliced")
# df_bread['is_sliced']=df_bread['is_sliced'].replace(False,"UnSliced")
df_bread[['name','is_sliced','item-price']].head(5)

Unnamed: 0,name,is_sliced,item-price
484,Cohens Bakery Buckingham Rye Bread,Unsliced,2.1
178,Essential White Medium Sliced Bread,Sliced,0.75
79,Essential Wholemeal Medium Sliced Bread,Sliced,0.75
227,Hovis 1886 Granary Sliced Bread,Sliced,1.5
61,Hovis 1886 Seeded Sliced Bread,Sliced,1.5


In [17]:
price = df_bread['item-price']
df_bread['price']=df_bread['item-price'].apply(lambda x : f'£ {x:.2f}')

In [18]:
def process(price):
    price_str = str(price).lower().strip()
    price_str = price_str.replace('£','').replace('p','')

    if '/100g' in price_str:
        price_str = price_str.replace('/100g','').strip()
        return float(price_str)/10
    elif '/kg' in price_str:
        price_str = price_str.replace('/kg','')
        return float(price_str)
    else:
        return float(price_str)
    
df_bread['price-per-kg'] = df_bread['price-per-unit'].apply(process)
df_bread[['name','price','size','price-per-kg','is_sliced','offer','offer_description']].set_index('name').head(5).reset_index()

Unnamed: 0,name,price,size,price-per-kg,is_sliced,offer,offer_description
0,Cohens Bakery Buckingham Rye Bread,£ 2.10,400g,5.25,Unsliced,False,not on offer
1,Essential White Medium Sliced Bread,£ 0.75,800g,0.94,Sliced,False,not on offer
2,Essential Wholemeal Medium Sliced Bread,£ 0.75,800g,0.94,Sliced,False,not on offer
3,Hovis 1886 Granary Sliced Bread,£ 1.50,450g,3.33,Sliced,True,save 30p. Was £1.80
4,Hovis 1886 Seeded Sliced Bread,£ 1.50,450g,3.33,Sliced,True,save 30p. Was £1.80


## Want to practice some more EDA?

Why stop on bread? 🍞

Take another look at the original list of products under `df` and see if you can find other interesting questions to ask and answer. Try comparing the prices of different products, or the number of sizes available for each product. The sky is the limit! 

# ☕️ Time for a Coffee Break!

# 3. Visualizing the data 

We will use the [lets-plot](https://lets-plot.org/) library to create some visualizations. `lets-plot` is a Python implementation of the popular `ggplot2` library in R, which is the most powerful example of the concept of the **Grammar of Graphics**.

In order to get the visualisation to work, I first need to rework the data a bit. To achieve the same result as me, you must find a way to create a DataFrame with the exact content as the one below.

<div style="width:80%;font-size:0.65em;margin-left:1em">

| name                                           | price   | size   |   price-per-kg | is_sliced   | offer   | offer_description   |
|:-----------------------------------------------|:--------|:-------|---------------:|:------------|:--------|:--------------------|
| Cohens Bakery Buckingham Rye Bread             | £ 2.10  | 400g   |        5.25    | Unsliced    | False   | Not On Offer        |
| Essential White Medium Sliced Bread            | £ 0.75  | 800g   |        0.9375  | Sliced      | False   | Not On Offer        |
| Essential Wholemeal Medium Sliced Bread        | £ 0.75  | 800g   |        0.9375  | Sliced      | False   | Not On Offer        |
| Hovis 1886 Granary Sliced Bread                | £ 1.50  | 450g   |        3.33333 | Sliced      | True    | save 30p. Was £1.80 |
| Hovis 1886 Seeded Sliced Bread                 | £ 1.50  | 450g   |        3.33333 | Sliced      | True    | save 30p. Was £1.80 |
| Hovis Best of Both Medium Sliced Bread         | £ 1.35  | 800g   |        1.6875  | Sliced      | False   | Not On Offer        |
| Hovis Granary Wholemeal Sliced Bread           | £ 1.90  | 800g   |        2.375   | Sliced      | False   | Not On Offer        |
| Hovis Original Granary Medium Sliced Bread     | £ 1.90  | 800g   |        2.375   | Sliced      | False   | Not On Offer        |
| Hovis Original Granary Thick Sliced Bread      | £ 1.25  | 400g   |        3.125   | Sliced      | False   | Not On Offer        |
| Hovis Seed Sensations Multiseeded Sliced Bread | £ 1.20  | 400g   |        3       | Sliced      | False   | Not On Offer        |
| Hovis Seed Sensations Multiseeded Sliced Bread | £ 1.90  | 800g   |        2.375   | Sliced      | False   | Not On Offer        |
| Hovis Soft White Medium Sliced White Bread     | £ 1.45  | 800g   |        1.8125  | Sliced      | False   | Not On Offer        |
| Hovis Soft White Thick Sliced White Bread      | £ 1.45  | 800g   |        1.8125  | Sliced      | False   | Not On Offer        |
| Hovis Wholemeal Medium Sliced Bread            | £ 1.10  | 400g   |        2.75    | Sliced      | False   | Not On Offer        |
| Hovis Wholemeal Medium Sliced Bread            | £ 1.45  | 800g   |        1.8125  | Sliced      | False   | Not On Offer        |
| Hovis Wholemeal Thick Sliced Bread             | £ 1.45  | 800g   |        1.8125  | Sliced      | False   | Not On Offer        |
| Irwin's Together Brown Soda Bread              | £ 2.00  | 400g   |        5       | Unsliced    | False   | Not On Offer        |
| Livlife Seriously Seeded Sliced Bread          | £ 2.00  | 500g   |        4       | Sliced      | False   | Not On Offer        |
| No.1 Malt Sourdough Bread with Seeds           | £ 2.20  | 500g   |        4.4     | Unsliced    | False   | Not On Offer        |
| No.1 Rye and Wheat Dark Sourdough Bread        | £ 2.70  | 500g   |        5.4     | Unsliced    | False   | Not On Offer        |
| No.1 Spelt Sourdough Bread                     | £ 2.70  | 500g   |        5.4     | Unsliced    | False   | Not On Offer        |
| No.1 White Sourdough Bread                     | £ 2.20  | 500g   |        4.4     | Unsliced    | False   | Not On Offer        |
| Schneider Brot Rye Bread with Sunflower Seeds  | £ 1.50  | 500g   |        3       | Unsliced    | False   | Not On Offer        |
| Seeded Sourdough Bread                         | £ 2.25  | 500g   |        4.5     | Unsliced    | False   | Not On Offer        |
| The Heart of Nature Pure Grain Bread           | £ 3.65  | 500g   |        7.3     | Unsliced    | False   | Not On Offer        |
| Vogel's Original Mixed Grain Bread             | £ 2.40  | 800g   |        3       | Unsliced    | False   | Not On Offer        |
| Vogel's Soya & Linseed Bread                   | £ 2.40  | 800g   |        3       | Unsliced    | False   | Not On Offer        |
| Wildfarmed Seeded Sourdough Bread              | £ 4.00  | 600g   |        6.66667 | Unsliced    | False   | Not On Offer        |
| Wildfarmed Sliced Seeded Bread                 | £ 2.80  | 550g   |        5.09091 | Sliced      | False   | Not On Offer        |
| Wildfarmed Sliced White Bread                  | £ 2.80  | 550g   |        5.09091 | Sliced      | False   | Not On Offer        |
| Wildfarmed White Sourdough Bread               | £ 4.00  | 600g   |        6.66667 | Unsliced    | False   | Not On Offer        |

</div>

Here's what you get when you run the `info()` method on this DataFrame:

<div style="width:30%;font-size:0.75em;margin-left:1em">

```python
<class 'pandas.core.frame.DataFrame'>
Index: 31 entries, 484 to 244
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               31 non-null     object 
 1   price              31 non-null     object 
 2   size               31 non-null     object 
 3   price-per-kg       31 non-null     float64
 4   is_sliced          31 non-null     object 
 5   offer              31 non-null     bool   
 6   offer_description  31 non-null     object 
dtypes: bool(1), float64(1), object(5)
memory usage: 1.7+ KB
```

</div>

💡 TIP: **Pay close attention as I explain the concept and the code below!**

If we don't have time to create the dataframe above together, I'll share the solution later.

<div style="color:transparent;background-color:transparent">

<details style="height:0.1em"><summary></summary>
plot_df = (
    df_bread.assign(is_sliced=lambda x: x['name'].str.contains('sliced', case=False),
                    price_per_kg=lambda x: x['item-price'] / x['size'].str.replace('g', '').astype('float') * 1000,
                    price=lambda x: x['item-price'].apply(lambda x: f"£ {x:.2f}"),
                    offer_description=lambda x: x['offer-description'].apply(lambda d: d if type(d) == str else 'Not On Offer'))
            .assign(is_sliced=lambda x: x['is_sliced'].map({True: 'Sliced', False: 'Unsliced'}))
            .drop(columns=['id', 'item-price', 'image-url', 'page', 'price-per-unit', 'offer-description'])
            .rename(columns={'price_per_kg': 'price-per-kg'})
            [['name', 'price', 'size', 'price-per-kg', 'is_sliced', 'offer', 'offer_description']]
)

print(plot_df.to_markdown(index=False))
</details>
</div>

In [19]:
plot_df = df_bread
# plot_df.set_index('name').head(5)

In [20]:
(
    ggplot(data=plot_df, 
           mapping=aes(x='is_sliced', y='price-per-kg', fill='is_sliced')) +
    geom_jitter(width=0.15, height=0, alpha=0.75, size=5, stroke=1.2, color="black", shape=21,
                tooltips=layer_tooltips().line('@name').line('@size').line('@price').line('@offer_description')) +
    geom_boxplot(width=0.35, alpha=0.35, color='black') +
    scale_x_discrete(name='') +
    scale_y_continuous(name='Price per kg (£)', breaks=list(range(0, 10)), limits=[0, 8], format='£ {.2f}') +
    labs(title='Sliced bread is consistently cheaper!', 
         subtitle='A comparison of the price per kg of sliced and unsliced bread',
         caption='Hover your mouse over the points to see the details') +

     theme(axis_text_x=element_text(size=17),
           axis_text_y=element_text(size=17),
           axis_title_x=element_text(size=20),
           axis_title_y=element_text(size=20),
           plot_title=element_text(size=22, face='bold'),
           plot_subtitle=element_text(size=18),
           legend_position='none') +
     ggsize(700, 400)
)

# afternoon session

In [21]:
gapminder = pd.read_csv('https://raw.githubusercontent.com/kirenz/datasets/master/gapminder.csv')


#convert year coloumn to date object

gapminder.year = pd.to_datetime(gapminder.year , format ='%Y')

In [22]:
gapminder.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   country    1704 non-null   object        
 1   continent  1704 non-null   object        
 2   year       1704 non-null   datetime64[ns]
 3   lifeExp    1704 non-null   float64       
 4   pop        1704 non-null   int64         
 5   gdpPercap  1704 non-null   float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 80.0+ KB


In [23]:
(
    ggplot(data = gapminder,mapping=aes(
        x = 'gdpPercap',
        y='lifeExp',
        color= 'continent',
        size = 'pop'
    ))+geom_point(alpha =0.5)
)

In [24]:
(
    ggplot(data = gapminder,mapping=aes(
        x = 'gdpPercap',
        y='lifeExp',
        color= 'continent',
        size = 'pop'
    ))+geom_point(alpha =0.5)+ggtitle("gdpPercap vs lifeExp")
)

In [25]:
(
    ggplot(data = gapminder,mapping=aes(
        x = 'gdpPercap',
        y='lifeExp',
        color= 'continent',
        size = 'pop'
    ))+geom_point(alpha =0.5)+ggtitle("gdpPercap vs lifeExp")+
    scale_x_log10()
)

In [26]:
(
    ggplot(data = gapminder,mapping=aes(
        x = 'gdpPercap',
        y='lifeExp',
        color= 'continent',
        size = 'pop'
    ))+geom_point(alpha =0.5)+ggtitle("gdpPercap vs lifeExp")+
    scale_x_log10(
        breaks=[1000,10000,25000,50000,100000]
    )
)

In [27]:
(
    ggplot(data = gapminder,mapping=aes(
        x = 'gdpPercap',
        y='lifeExp',
        color= 'continent',
        size = 'pop'
    ))+geom_point(alpha =0.5)+ggtitle("gdpPercap vs lifeExp")+
    scale_x_log10(
        breaks=[1000,10000,25000,50000,100000],
        labels = [f'${nz.numerize(x)}' for x in[1000,10000,25000,50000,100000]]
    )
)

In [28]:
(
    ggplot(data = gapminder,mapping=aes(
        x = 'gdpPercap',
        y='lifeExp',
        color= 'continent',
        size = 'pop'
    ))+geom_point(alpha =0.5)+ggtitle("gdpPercap vs lifeExp")+
    scale_x_log10(
        breaks=[1000,10000,25000,50000,100000],
        labels = [f'${nz.numerize(x)}' for x in[1000,10000,25000,50000,100000]]
    )+
    theme_bw()
)

In [29]:
(
    ggplot(data = gapminder,mapping=aes(
        x = 'gdpPercap',
        y='lifeExp',
        color= 'continent',
        size = 'pop'
    ))+geom_point(alpha =0.5)+ggtitle("gdpPercap vs lifeExp")+
    scale_x_log10(
        breaks=[1000,10000,25000,50000,100000],
        labels = [f'${nz.numerize(x)}' for x in[1000,10000,25000,50000,100000]]
    )+
    theme_bw()+
    labs(
        x = "GDP per Capita",
        y = "Life EXP",
        color= "Continent",
        size= "Population"
    )
)

# PIVOT TABLE

In [30]:
gap_pivot_lifeExp = gapminder.pivot_table(
    index='year',columns='continent',
    values='lifeExp',aggfunc="mean"
)
gap_pivot_lifeExp.head()

continent,Africa,Americas,Asia,Europe,Oceania
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1952-01-01,39.1355,53.27984,46.314394,64.4085,69.255
1957-01-01,41.266346,55.96028,49.318544,66.703067,70.295
1962-01-01,43.319442,58.39876,51.563223,68.539233,71.085
1967-01-01,45.334538,60.41092,54.66364,69.7376,71.31
1972-01-01,47.450942,62.39492,57.319269,70.775033,71.91


In [31]:
gap_pivot_gdppercap=gapminder.pivot_table(
    index="year",columns='country',values='gdpPercap',aggfunc='mean'
)
gap_pivot_gdppercap.head()

country,Afghanistan,Albania,Algeria,Angola,Argentina,Australia,Austria,Bahrain,Bangladesh,Belgium,...,Uganda,United Kingdom,United States,Uruguay,Venezuela,Vietnam,West Bank and Gaza,"Yemen, Rep.",Zambia,Zimbabwe
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1952-01-01,779.445314,1601.056136,2449.008185,3520.610273,5911.315053,10039.59564,6137.076492,9867.084765,684.244172,8343.105127,...,734.753484,9979.508487,13990.48208,5716.766744,7689.799761,605.066492,1515.592329,781.717576,1147.388831,406.884115
1957-01-01,820.85303,1942.284244,3013.976023,3827.940465,6856.856212,10949.64959,8842.59803,11635.79945,661.637458,9714.960623,...,774.371069,11283.17795,14847.12712,6150.772969,9802.466526,676.285448,1827.067742,804.830455,1311.956766,518.764268
1962-01-01,853.10071,2312.888958,2550.81688,4269.276742,7133.166023,12217.22686,10750.72111,12753.27514,686.341554,10991.20676,...,767.27174,12477.17707,16173.14586,5603.357717,8422.974165,772.04916,2198.956312,825.623201,1452.725766,527.272182
1967-01-01,836.197138,2760.196931,3246.991771,5522.776375,8052.953021,14526.12465,12834.6024,14804.6727,721.186086,13149.04119,...,908.918522,14142.85089,19530.36557,5444.61962,9541.474188,637.123289,2649.715007,862.442146,1777.077318,569.795071
1972-01-01,739.981106,3313.422188,4182.663766,5473.288005,9443.038526,16788.62948,16661.6256,18268.65839,630.233627,16672.14356,...,950.735869,15895.11641,21806.03594,5703.408898,10505.25966,699.501644,3133.409277,1265.047031,1773.498265,799.362176


In [32]:
#LONG FORMAT 

gap_melt = gap_pivot_lifeExp.melt(ignore_index=False).reset_index()
gap_melt.head()

Unnamed: 0,year,continent,value
0,1952-01-01,Africa,39.1355
1,1957-01-01,Africa,41.266346
2,1962-01-01,Africa,43.319442
3,1967-01-01,Africa,45.334538
4,1972-01-01,Africa,47.450942


In [33]:
gapminder_melt_life_exp = gap_pivot_lifeExp.reset_index() \
    .melt(id_vars='year', 
          var_name='continent', 
          value_name='lifeExp')

gapminder_melt_life_exp.head()

Unnamed: 0,year,continent,lifeExp
0,1952-01-01,Africa,39.1355
1,1957-01-01,Africa,41.266346
2,1962-01-01,Africa,43.319442
3,1967-01-01,Africa,45.334538
4,1972-01-01,Africa,47.450942


In [34]:
gap_melt_gdpPercap = gap_pivot_gdppercap.reset_index() \
    .melt(id_vars='year',
          var_name='Country',
          value_name='gdpPercapita')
gap_melt_gdpPercap.head()

Unnamed: 0,year,Country,gdpPercapita
0,1952-01-01,Afghanistan,779.445314
1,1957-01-01,Afghanistan,820.85303
2,1962-01-01,Afghanistan,853.10071
3,1967-01-01,Afghanistan,836.197138
4,1972-01-01,Afghanistan,739.981106


# 3. A few more examples: faceting

In [35]:
import statsmodels.api as sm

mtcars = sm.datasets.get_rdataset('mtcars','datasets',cache=True).data

mtcars= pd.DataFrame(mtcars)

mtcars.head()

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [36]:
(
    ggplot(mtcars,aes('wt','mpg',color='hp'))+
    geom_point()+
    labs(color="Horse Power")+
    geom_smooth(method='lm')+
    theme_minimal()
)

In [37]:
(
    ggplot(
        data = mtcars, 
        mapping = aes(
            x = 'wt', 
            y = 'mpg', 
            color = 'cyl'
            )
        ) +
        geom_point() +
        geom_smooth(method = 'lm') +
        theme_minimal()
)

In [38]:
(
  ggplot(data = mtcars,
       mapping = aes(x = 'disp',
                     y = 'mpg',
                     color = 'gear')) +
    geom_point() +
    geom_smooth(method = "lm") +
    theme_minimal()+
    scale_color_viridis() 
)

In [39]:
(
    ggplot(data = mtcars,
       mapping = aes(x = 'disp',
                     y = 'mpg',
                     color = 'gear')) +
        geom_point() +
        geom_smooth(method = "lm") +
        scale_color_viridis() +
        theme_bw()+
        facet_wrap('gear', ncol = 2) #<<
)


In [40]:
(
     ggplot(data = mtcars,
          mapping = aes(x = 'disp',
                         y = 'mpg',
                         color = 'hp')) +
          geom_point() +
          geom_smooth(method = "lm") +
          scale_color_viridis() +
          facet_wrap('gear', ncol = 2, 
                      format = '{} gears') +
          labs(x = "Displacement", y = "Highway MPG",  #<<
               color = "Horsepower",   #<<
               title = "Heavier cars get lower mileage",  #<<
               subtitle = "Displacement indicates weight(?)",  #<<
               caption = "I know nothing about cars")+
               theme_bw()
          
)

In [41]:
(
    ggplot(data = mtcars,
          mapping = aes(x = 'disp',
                     y = 'mpg',
                     color = 'hp')
     ) +
     geom_point() +
     geom_smooth(method = "lm") +
     scale_color_viridis(breaks = [100, 200, 300],option='twilight',direction=-1) +
     facet_wrap('gear', ncol = 3, format = '{} gears') +
     labs(x = "Displacement", y = "Highway MPG",  #<<
          color = "Horsepower",   #<<
          title = "Heavier cars get lower mileage",  #<<
          subtitle = "Displacement indicates weight(?)",  #<<
          caption = "I know nothing about cars") +
     theme_bw() + 
     theme(panel_grid_major=element_blank(),legend_position = "left", #<<
        plot_title = element_text(face = "bold",color='blue')) #<<
) 

In [42]:
# get 2007 and 1952 data
_gapminder = gapminder[gapminder.year.isin([pd.to_datetime('2007', format='%Y'), pd.to_datetime('1952', format='%Y')])]
_gapminder.year = _gapminder.year.dt.year



(
    ggplot(data = _gapminder, mapping = aes(
                    x = 'gdpPercap', 
                    y = 'lifeExp', 
                    color = 'continent', 
                    size = 'pop')
             ) +
        geom_point(alpha=0.5) +
        facet_wrap('year',ncol=1)+
        scale_x_log10(
            breaks = [1000, 10000, 25000, 50000, 100000],
            labels = [f'${nz.numerize(x)}' for x in [1000, 10000, 25000, 50000, 100000]]
            ) +
        theme_minimal() +
        labs(
            x = 'GDP per Capita',
            y = 'Life Expectancy',
            color = 'Continent',
            size = 'Population' 
        )
        
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  _gapminder.year = _gapminder.year.dt.year


In [43]:
# get 2007 and 1952 data
_filt = [pd.to_datetime(x, format='%Y') for x in ['2007', '1952']]
_gapminder = gapminder[gapminder.year.isin(_filt)]
_gapminder.year = _gapminder.year.dt.year   

(
    ggplot(data = _gapminder, mapping = aes(
                    x = 'gdpPercap', 
                    y = 'lifeExp', 
                    color = 'continent', 
                    size = 'pop')
                    ) +
            geom_point(alpha=0.5) +
            scale_x_log10(
                breaks = [1000, 10000, 25000, 50000, 100000],
                labels = [f'${nz.numerize(x)}' for x in [1000, 10000, 25000, 50000, 100000]]
                ) +
            facet_wrap('year', ncol=1) + ## <<<<
            theme_minimal() +
            labs(
                x = 'GDP per Capita',
                y = 'Life Expectancy',
                color = 'Continent',
                size = 'Population' 
        )
        
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  _gapminder.year = _gapminder.year.dt.year


In [44]:
(
    ggplot(data = gapminder, 
       mapping = aes(x = 'continent',
                     y = 'lifeExp',
                     fill = 'continent')
       ) +
        geom_violin(alpha=0.5) +
        geom_boxplot(alpha = 0.5) +
        guides(fill = 'none')  +# Turn off legend
        labs(
            title = 'Life Expectancy by Continent',
            y = 'Life Expectancy',
            # x is nothing
            x = 'Continent'
        ) +
        theme(panel_grid_major=element_blank(),plot_title=element_text(family='bold',color='green',size=20))
)

In [45]:
import plotly.express as px

In [46]:
px.scatter(gapminder,
           x='gdpPercap',
           y='lifeExp',
           animation_frame="year",
           animation_group="country",
           size="pop",color="continent",
           hover_name='country',
           log_x=True,
           size_max=100,
           range_x=[100,10000],
           range_y=[25,90]
           )