# Python for Data Science
## Session 6
### Datasets – Pandas I 

---

## Outline

1. Pandas

2. Loading and exploring datasets 

3. Data cleaning and preprocessing with Pandas 

---

## Pandas I

**Pandas** is one of the most used libraries within the Data Science community. It provides a full set of tools to work with:
- 2D data via DataFrame class
    - SQL/Spreadsheet-like datasets (tabular data)
    - Arbitrary matrix data with row and column labels
    - Any type of dataset with observational / statistical data (no labels needed)
- 1D data via Series class
    - Time series data


## Pandas I

As one can see, from the type of datasets **Pandas** can handle, most of the use cases found in Data Science are covered, e.g. finance, health, biology, supply chain, or meteorology. Things **Pandas** do:

1. Handling missing data
2. Change DataFrame size, adding and removing columns and rows at will
3. Automatic data alignment of misaligned data
4. Group by operation
5. Data conversion
6. Advanced indexing
7. Data Merging and joining 
8. Reshaping data
9. Hierarchical indexing
10. Read/write support for CSV, Excel, databases, and fast HDF5 format.
11. Time series manipulation, frequency convertion, window moving statistics


## Pandas I

Let's create a few dataframes using different data structures and see how to start navigating them


In [1]:
## Let's create a simple DataFrame to work with
import pandas as pd
dataset = [
    {
        "name": "Amelie",
        "age": 35
    },
    {
        "name": "Edgar",
        "age": 32
    }
]
df = pd.DataFrame(dataset)
df

Unnamed: 0,name,age
0,Amelie,35
1,Edgar,32


In [2]:
# Using a slightly different data struture
data = {'Name': ['Amelie', 'Edgar', 'Carlos', 'Victor'],
        'Age': [24, 27, 22, 32],
        'Country': ['FR', 'FR', 'ES', 'GE']}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Country
0,Amelie,24,FR
1,Edgar,27,FR
2,Carlos,22,ES
3,Victor,32,GE


In [3]:
# we can quickly visualize all the columns a dataframe contains by simply
df.columns

Index(['Name', 'Age', 'Country'], dtype='object')

In [4]:
# Another important thing we can do is to set up any specific index
# these act as row labels
df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])
df

Unnamed: 0,Name,Age,Country
A,Amelie,24,FR
B,Edgar,27,FR
C,Carlos,22,ES
D,Victor,32,GE


In [5]:
# Similar to lists and arrays, we can slice
df[:2]

Unnamed: 0,Name,Age,Country
A,Amelie,24,FR
B,Edgar,27,FR


In [6]:
# Similar to lists and arrays, we can also access elements making use of loc and iloc
df.loc['B':'D'] #access elements with loc (accesing by indexing)

Unnamed: 0,Name,Age,Country
B,Edgar,27,FR
C,Carlos,22,ES
D,Victor,32,GE


In [7]:
# We can access elements using loc and iloc
df.iloc[1:]

Unnamed: 0,Name,Age,Country
B,Edgar,27,FR
C,Carlos,22,ES
D,Victor,32,GE


In [8]:
# We can also access element attributes by using at
df.at['A', 'Country']

'FR'

In [9]:
# if we wanted to do it with iloc, we would need to pass positions
df.iloc[0,1]

np.int64(24)

In [10]:
# Important: to modify a specific value we can use at, loc and iloc
df.loc['B', 'Age'] = 123
df

Unnamed: 0,Name,Age,Country
A,Amelie,24,FR
B,Edgar,123,FR
C,Carlos,22,ES
D,Victor,32,GE


In [11]:
# we can also check what elements satisfy some criteria by column (attribute)
df['Age'] > 25

A    False
B     True
C    False
D     True
Name: Age, dtype: bool

In [12]:
# and show them
df[df['Age'] > 25]

Unnamed: 0,Name,Age,Country
B,Edgar,123,FR
D,Victor,32,GE


In [13]:
# or modify a simple attribute of those that satisfy certain condition
df.loc[df['Age'] > 25, 'Name'] = 'Unknown' # df.iloc[df['Age'] > 25, 0] = 'Unknown'
df

Unnamed: 0,Name,Age,Country
A,Amelie,24,FR
B,Unknown,123,FR
C,Carlos,22,ES
D,Unknown,32,GE


In [14]:
# we can also, similar to the way we do with dicts, add new columns
df['Residency'] = [True, False, True, False]

In [15]:
# or add new elements
df.loc[6] = ['Jordi', 23, 'ES', False]
df.loc[9] = ['Anna', 19, 'ES', False]
df

Unnamed: 0,Name,Age,Country,Residency
A,Amelie,24,FR,True
B,Unknown,123,FR,False
C,Carlos,22,ES,True
D,Unknown,32,GE,False
6,Jordi,23,ES,False
9,Anna,19,ES,False


In [16]:
# There are other methods one can also call
df.head(2)

Unnamed: 0,Name,Age,Country,Residency
A,Amelie,24,FR,True
B,Unknown,123,FR,False


In [17]:
# There are other methods one can also call
df.tail(2)

Unnamed: 0,Name,Age,Country,Residency
6,Jordi,23,ES,False
9,Anna,19,ES,False


In [18]:
df.shape # to get the dataframe shape

(6, 4)

In [19]:
df.dtypes

Name         object
Age           int64
Country      object
Residency      bool
dtype: object

In [20]:
df['Age'].astype('float')

A     24.0
B    123.0
C     22.0
D     32.0
6     23.0
9     19.0
Name: Age, dtype: float64

In [21]:
# Same as numpy, we can find the unique method
df['Country'].unique()

array(['FR', 'ES', 'GE'], dtype=object)

## Pandas I

Among the different methods two important ones are **info** and **describe**:

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, A to 9
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Name       6 non-null      object
 1   Age        6 non-null      int64 
 2   Country    6 non-null      object
 3   Residency  6 non-null      bool  
dtypes: bool(1), int64(1), object(2)
memory usage: 370.0+ bytes


In [23]:
df.describe()

Unnamed: 0,Age
count,6.0
mean,40.5
std,40.648493
min,19.0
25%,22.25
50%,23.5
75%,30.0
max,123.0


## Pandas I

And how do we know about missing elements and handle them?

In [24]:
# What about missing values? Missing values are usually represented as NaN (Not a Number)
import numpy as np
data = {'Name': ['Amelie', 'Edgar', 'Carlos', 'Victor'],
        'Age': [24, 27, 22, 32],
        'Country': ['FR', np.nan, 'ES', 'GE']}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Country
0,Amelie,24,FR
1,Edgar,27,
2,Carlos,22,ES
3,Victor,32,GE


In [25]:
# we can simply know which ones present missing values by using isna or isnull
df.isna() # at the whole dataframe level

Unnamed: 0,Name,Age,Country
0,False,False,False
1,False,False,True
2,False,False,False
3,False,False,False


In [26]:
# and at the column level
df['Country'].isnull() # same as .isna()

0    False
1     True
2    False
3    False
Name: Country, dtype: bool

In [27]:
# Simple way of filling these ones is achieved by calling fillna
df.fillna('RE')

Unnamed: 0,Name,Age,Country
0,Amelie,24,FR
1,Edgar,27,RE
2,Carlos,22,ES
3,Victor,32,GE


In [28]:
# There's also possible thing you can do, use the mode (most common value)
#df['Country'].mode()
df.fillna(df['Country'].mode()[0])

Unnamed: 0,Name,Age,Country
0,Amelie,24,FR
1,Edgar,27,ES
2,Carlos,22,ES
3,Victor,32,GE


In [29]:
# IMPORTANT: if we check again the dataframe, we will see that the dataframe keeps having NaNs
# This is because the intented behaviour is usually to keep the DataFrame immutable
# If you want to actually modify you can pass ot fillna, inplace=True
df

Unnamed: 0,Name,Age,Country
0,Amelie,24,FR
1,Edgar,27,
2,Carlos,22,ES
3,Victor,32,GE


In [30]:
# A more drastic operation is to drop any column or row with missing values using
# here, we will modify the original dataframe and drop the rows and cols with nans
df.at[0, 'Age'] = np.nan
df

Unnamed: 0,Name,Age,Country
0,Amelie,,FR
1,Edgar,27.0,
2,Carlos,22.0,ES
3,Victor,32.0,GE


In [31]:
df.dropna()

Unnamed: 0,Name,Age,Country
2,Carlos,22.0,ES
3,Victor,32.0,GE


## Pandas I

One of the handy tools you have in **Pandas** is **groupby**. It allows you to group by any unique attributes within a column and get statistics:
- count
- mean
- sum
- min and max
- multiple aggregations
- group using multiple columns

In [32]:
data = {
    'Name': ['Amelie', 'Edgar', 'Carlos', 'Victor', 'Sofia', 'Jin', 
             'Marta', 'Ali', 'Emily', 'Ravi', 'Chen', 'Fatima', 'Saham'],
    'Age': [24, 27, 22, 32, 29, 31, 28, 26, 23, 34, 25, 30, 26],
    'Country': ['FR', 'US', 'ES', 'GE', 'PT', 'KR',
                'ES', 'AE', 'US', 'IN', 'CN', 'AE', 'AE'],
    'Salary': [70000, 110000, 65000, 82000, 48000, 39000, 45000,
            90000, 97000, 31000, 49000, 85000, 80000]   
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Country,Salary
0,Amelie,24,FR,70000
1,Edgar,27,US,110000
2,Carlos,22,ES,65000
3,Victor,32,GE,82000
4,Sofia,29,PT,48000
5,Jin,31,KR,39000
6,Marta,28,ES,45000
7,Ali,26,AE,90000
8,Emily,23,US,97000
9,Ravi,34,IN,31000


In [33]:
# let's count how many we have per country
df.groupby('Country').count()

Unnamed: 0_level_0,Name,Age,Salary
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AE,3,3,3
CN,1,1,1
ES,2,2,2
FR,1,1,1
GE,1,1,1
IN,1,1,1
KR,1,1,1
PT,1,1,1
US,2,2,2


In [34]:
df.groupby('Country')['Age'].mean() # now let's know the average age per country

Country
AE    27.333333
CN    25.000000
ES    25.000000
FR    24.000000
GE    32.000000
IN    34.000000
KR    31.000000
PT    29.000000
US    25.000000
Name: Age, dtype: float64

In [35]:
df.groupby('Country')['Salary'].sum() # total sum of their salaries

Country
AE    255000
CN     49000
ES    110000
FR     70000
GE     82000
IN     31000
KR     39000
PT     48000
US    207000
Name: Salary, dtype: int64

In [36]:
# Multiple Aggregations
df.groupby('Country').agg({'Salary': ['min', 'max'], 'Age': 'mean'})

Unnamed: 0_level_0,Salary,Salary,Age
Unnamed: 0_level_1,min,max,mean
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
AE,80000,90000,27.333333
CN,49000,49000,25.0
ES,45000,65000,25.0
FR,70000,70000,24.0
GE,82000,82000,32.0
IN,31000,31000,34.0
KR,39000,39000,31.0
PT,48000,48000,29.0
US,97000,110000,25.0


In [37]:
# grouping using more than one column
df.groupby(['Country', 'Age'])['Salary'].count()

Country  Age
AE       26     2
         30     1
CN       25     1
ES       22     1
         28     1
FR       24     1
GE       32     1
IN       34     1
KR       31     1
PT       29     1
US       23     1
         27     1
Name: Salary, dtype: int64

## Pandas I

Let's load *Netflix* titles and do some exercices

In [38]:
# Download from Moodle the zip file containing the netflix dataset
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/netflix_titles.csv'

df = pd.read_csv(path)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [39]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

## Pandas I
Let's do the following exercices:

1. Count Missing Values in Each Column

2. Fill Missing 'country' Values with "Unknown"

3. Filter for TV Shows Only

4. Count the Number of Entries per Rating

5. Add a Column Showing Content Age (how many years since it came out)


In [40]:
df.isna().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

In [41]:
df['country'].isna()

0       False
1       False
2        True
3        True
4       False
        ...  
8802    False
8803     True
8804    False
8805    False
8806    False
Name: country, Length: 8807, dtype: bool

In [42]:
df['country'][df['country'].isna()]

2       NaN
3       NaN
5       NaN
6       NaN
10      NaN
       ... 
8718    NaN
8759    NaN
8783    NaN
8785    NaN
8803    NaN
Name: country, Length: 831, dtype: object

In [43]:
df['country'][df['country'].isna()] = 'Unknown'
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       8807 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['country'][df['country'].isna()] = 'Unknown'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['country'][d

In [44]:
df['country'].fillna('Unknown', inplace=True) #the same as the one above
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       8807 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['country'].fillna('Unknown', inplace=True) #the same as the one above


In [45]:
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/netflix_titles.csv'

df = pd.read_csv(path)
df[df['type']== 'TV Show']

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
5,s6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, H...",,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...
...,...,...,...,...,...,...,...,...,...,...,...,...
8795,s8796,TV Show,Yu-Gi-Oh! Arc-V,,"Mike Liscio, Emily Bauer, Billy Bob Thompson, ...","Japan, Canada","May 1, 2018",2015,TV-Y7,2 Seasons,"Anime Series, Kids' TV",Now that he's discovered the Pendulum Summonin...
8796,s8797,TV Show,Yunus Emre,,"Gökhan Atalay, Payidar Tüfekçioglu, Baran Akbu...",Turkey,"January 17, 2017",2016,TV-PG,2 Seasons,"International TV Shows, TV Dramas","During the Mongol invasions, Yunus Emre leaves..."
8797,s8798,TV Show,Zak Storm,,"Michael Johnston, Jessica Gee-George, Christin...","United States, France, South Korea, Indonesia","September 13, 2018",2016,TV-Y7,3 Seasons,Kids' TV,Teen surfer Zak Storm is mysteriously transpor...
8800,s8801,TV Show,Zindagi Gulzar Hai,,"Sanam Saeed, Fawad Khan, Ayesha Omer, Mehreen ...",Pakistan,"December 15, 2016",2012,TV-PG,1 Season,"International TV Shows, Romantic TV Shows, TV ...","Strong-willed, middle-class Kashaf and carefre..."


In [46]:
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/netflix_titles.csv'

df = pd.read_csv(path)
df.groupby('rating').count()

Unnamed: 0_level_0,show_id,type,title,director,cast,country,date_added,release_year,duration,listed_in,description
rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
66 min,1,1,1,1,1,1,1,1,0,1,1
74 min,1,1,1,1,1,1,1,1,0,1,1
84 min,1,1,1,1,1,1,1,1,0,1,1
G,41,41,41,41,40,41,41,41,41,41,41
NC-17,3,3,3,2,3,3,3,3,3,3,3
NR,80,80,80,75,63,80,79,80,80,80,80
PG,287,287,287,286,279,281,287,287,287,287,287
PG-13,490,490,490,489,477,482,490,490,490,490,490
R,799,799,799,795,790,788,799,799,799,799,799
TV-14,2160,2160,2160,1457,1955,1930,2157,2160,2160,2160,2160


In [47]:
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/netflix_titles.csv'

df = pd.read_csv(path)
import time 
import datetime 

In [48]:
t= datetime.datetime.now()- datetime.datetime.strptime('2022', '%Y')

In [49]:
def how_many_days (t):
    value= datetime.datetime.now()- datetime.datetime.strptime(str(t), '%Y')
    return value 
df['release_year'].apply(how_many_days)

0      1778 days 08:26:29.007491
1      1412 days 08:26:29.007638
2      1412 days 08:26:29.007669
3      1412 days 08:26:29.007735
4      1412 days 08:26:29.007752
                  ...           
8802   6526 days 08:26:29.193146
8803   2508 days 08:26:29.193156
8804   5795 days 08:26:29.193166
8805   6891 days 08:26:29.193176
8806   3604 days 08:26:29.193186
Name: release_year, Length: 8807, dtype: timedelta64[ns]

## Pandas I

Let's now load the *Titanic* dataset and practice a little bit more:

1. Count the Missing Values in Each Column

2. Fill Missing 'Age' Values with the Mean Age

3. Fill Missing 'Embarked' Values with the Mode (Most Common Value)

4. Filter and Display Passengers Who Paid a Fare Above the Average Fare

5. Add a New Column Indicating Family Size. Create a new column 'FamilySize' as the sum of 'SibSp' (siblings/spouses) and 'Parch' (parents/children)

In [50]:
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/train_and_test2.csv'

df = pd.read_csv(path)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 28 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Passengerid  1309 non-null   int64  
 1   Age          1309 non-null   float64
 2   Fare         1309 non-null   float64
 3   Sex          1309 non-null   int64  
 4   sibsp        1309 non-null   int64  
 5   zero         1309 non-null   int64  
 6   zero.1       1309 non-null   int64  
 7   zero.2       1309 non-null   int64  
 8   zero.3       1309 non-null   int64  
 9   zero.4       1309 non-null   int64  
 10  zero.5       1309 non-null   int64  
 11  zero.6       1309 non-null   int64  
 12  Parch        1309 non-null   int64  
 13  zero.7       1309 non-null   int64  
 14  zero.8       1309 non-null   int64  
 15  zero.9       1309 non-null   int64  
 16  zero.10      1309 non-null   int64  
 17  zero.11      1309 non-null   int64  
 18  zero.12      1309 non-null   int64  
 19  zero.1

In [51]:
df.isna().sum()

Passengerid    0
Age            0
Fare           0
Sex            0
sibsp          0
zero           0
zero.1         0
zero.2         0
zero.3         0
zero.4         0
zero.5         0
zero.6         0
Parch          0
zero.7         0
zero.8         0
zero.9         0
zero.10        0
zero.11        0
zero.12        0
zero.13        0
zero.14        0
Pclass         0
zero.15        0
zero.16        0
Embarked       2
zero.17        0
zero.18        0
2urvived       0
dtype: int64

In [52]:
df['Age'].isna()

0       False
1       False
2       False
3       False
4       False
        ...  
1304    False
1305    False
1306    False
1307    False
1308    False
Name: Age, Length: 1309, dtype: bool

In [53]:
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/train_and_test2.csv'

df = pd.read_csv(path)
df.fillna(df['Age'].mean())

Unnamed: 0,Passengerid,Age,Fare,Sex,sibsp,zero,zero.1,zero.2,zero.3,zero.4,...,zero.12,zero.13,zero.14,Pclass,zero.15,zero.16,Embarked,zero.17,zero.18,2urvived
0,1,22.0,7.2500,0,1,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
1,2,38.0,71.2833,1,1,0,0,0,0,0,...,0,0,0,1,0,0,0.0,0,0,1
2,3,26.0,7.9250,1,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,1
3,4,35.0,53.1000,1,1,0,0,0,0,0,...,0,0,0,1,0,0,2.0,0,0,1
4,5,35.0,8.0500,0,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,1305,28.0,8.0500,0,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
1305,1306,39.0,108.9000,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0.0,0,0,0
1306,1307,38.5,7.2500,0,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
1307,1308,28.0,8.0500,0,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0


In [54]:
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/train_and_test2.csv'

df = pd.read_csv(path)
df.fillna(df['Embarked'].mode()[0])

Unnamed: 0,Passengerid,Age,Fare,Sex,sibsp,zero,zero.1,zero.2,zero.3,zero.4,...,zero.12,zero.13,zero.14,Pclass,zero.15,zero.16,Embarked,zero.17,zero.18,2urvived
0,1,22.0,7.2500,0,1,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
1,2,38.0,71.2833,1,1,0,0,0,0,0,...,0,0,0,1,0,0,0.0,0,0,1
2,3,26.0,7.9250,1,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,1
3,4,35.0,53.1000,1,1,0,0,0,0,0,...,0,0,0,1,0,0,2.0,0,0,1
4,5,35.0,8.0500,0,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,1305,28.0,8.0500,0,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
1305,1306,39.0,108.9000,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0.0,0,0,0
1306,1307,38.5,7.2500,0,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
1307,1308,28.0,8.0500,0,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0


In [55]:
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/train_and_test2.csv'

df = pd.read_csv(path)
df[df['Fare'].mean()< df['Fare']]

Unnamed: 0,Passengerid,Age,Fare,Sex,sibsp,zero,zero.1,zero.2,zero.3,zero.4,...,zero.12,zero.13,zero.14,Pclass,zero.15,zero.16,Embarked,zero.17,zero.18,2urvived
1,2,38.0,71.2833,1,1,0,0,0,0,0,...,0,0,0,1,0,0,0.0,0,0,1
3,4,35.0,53.1000,1,1,0,0,0,0,0,...,0,0,0,1,0,0,2.0,0,0,1
6,7,54.0,51.8625,0,0,0,0,0,0,0,...,0,0,0,1,0,0,2.0,0,0,0
23,24,28.0,35.5000,0,0,0,0,0,0,0,...,0,0,0,1,0,0,2.0,0,0,1
27,28,19.0,263.0000,0,3,0,0,0,0,0,...,0,0,0,1,0,0,2.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1293,1294,22.0,59.4000,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0.0,0,0,0
1294,1295,17.0,47.1000,0,0,0,0,0,0,0,...,0,0,0,1,0,0,2.0,0,0,0
1298,1299,50.0,211.5000,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0.0,0,0,0
1302,1303,37.0,90.0000,1,1,0,0,0,0,0,...,0,0,0,1,0,0,1.0,0,0,0


In [56]:
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/train_and_test2.csv'

df = pd.read_csv(path)
df['Family Size'] = df['sibsp']+ df['Parch']

df


Unnamed: 0,Passengerid,Age,Fare,Sex,sibsp,zero,zero.1,zero.2,zero.3,zero.4,...,zero.13,zero.14,Pclass,zero.15,zero.16,Embarked,zero.17,zero.18,2urvived,Family Size
0,1,22.0,7.2500,0,1,0,0,0,0,0,...,0,0,3,0,0,2.0,0,0,0,1
1,2,38.0,71.2833,1,1,0,0,0,0,0,...,0,0,1,0,0,0.0,0,0,1,1
2,3,26.0,7.9250,1,0,0,0,0,0,0,...,0,0,3,0,0,2.0,0,0,1,0
3,4,35.0,53.1000,1,1,0,0,0,0,0,...,0,0,1,0,0,2.0,0,0,1,1
4,5,35.0,8.0500,0,0,0,0,0,0,0,...,0,0,3,0,0,2.0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,1305,28.0,8.0500,0,0,0,0,0,0,0,...,0,0,3,0,0,2.0,0,0,0,0
1305,1306,39.0,108.9000,1,0,0,0,0,0,0,...,0,0,1,0,0,0.0,0,0,0,0
1306,1307,38.5,7.2500,0,0,0,0,0,0,0,...,0,0,3,0,0,2.0,0,0,0,0
1307,1308,28.0,8.0500,0,0,0,0,0,0,0,...,0,0,3,0,0,2.0,0,0,0,0


## Pandas I

Home exercises for Netflix:

1. Is there any missing rating?
2. How many films in 2021 correspond to your country?
3. What's the number of movies in 2020 with full information?
4. Give me the year with more titles,
5. and what has been the average in terms of releases from 2010. 

And for Titanic:

1. Calculate Gender-Based Survival Percentage

2. Calculate Survival Percentage Grouped by Gender and Class

In [57]:
#1. Is there any missing rating?
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/netflix_titles.csv'

df = pd.read_csv(path)
df.isna().sum()


show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

There are 4 missing ratings

In [58]:
#2. How many films in 2021 correspond to your country?
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/netflix_titles.csv'
df = pd.read_csv(path)

df[(df['country'] == 'Spain') & (df['release_year'] == 2021)]
len(df[(df['country'] == 'Spain') & (df['release_year'] == 2021)])

14

In [59]:
#3. What's the number of movies in 2020 with full information?
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/netflix_titles.csv'
df = pd.read_csv(path)

df.dropna(inplace=True)
len(df[(df['release_year'] == 2020) & (df['type'] == 'Movie')])

409

In [60]:
#4. Give me the year with more titles,
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/netflix_titles.csv'
df = pd.read_csv(path)
number_titles = df.groupby('release_year')['title'].count()
years_sorted = number_titles.sort_values(ascending=False)
years_sorted.head(1)

release_year
2018    1147
Name: title, dtype: int64

In [66]:
#5. and what has been the average in terms of releases from 2010. 
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/netflix_titles.csv'
df = pd.read_csv(path)

year = df[df["release_year"] > 2009]
sum = year["release_year"].value_counts().sum()
average = sum / len(year["release_year"].value_counts())

print( int(average))


622


In [62]:
#And for Titanic:
#1. Calculate Gender-Based Survival Percentage
path = '/Users/sofiamarroquin/Desktop/MSC BA/Python/Session6/train_and_test2.csv'

df = pd.read_csv(path)
info= df.groupby('Sex')['2urvived'].agg(['sum', 'count'])
info['survival_percentage'] = (info['sum'] / info['count']) * 100

info

Unnamed: 0_level_0,sum,count,survival_percentage
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,109,843,12.930012
1,233,466,50.0


In [63]:
#2. Calculate Survival Percentage Grouped by Gender and Class
survival= df.groupby(['Sex', 'Pclass'])['2urvived'].agg(['sum', 'count'])
survival['survival_percentage'] = (survival['sum'] / survival['count']) * 100

survival



Unnamed: 0_level_0,Unnamed: 1_level_0,sum,count,survival_percentage
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1,45,179,25.139665
0,2,17,171,9.94152
0,3,47,493,9.533469
1,1,91,144,63.194444
1,2,70,106,66.037736
1,3,72,216,33.333333
