# Creating selections and subsets of your data

### There are many ways to get selections or subsets of your data:
#### - selecting a column with `df['averageRating']`
#### - selecting multiple columns using a list: `df[['tconst', 'averageRating']]`
#### - selecting a subset using a condition: `df[df['averageRating'] > 9.0]`
#### - using `.query("averageRating > 0")`

### Let's first read in our data again and check the first few lines

In [1]:
import pandas as pd
pd.options.display.max_columns = 50

df = pd.read_csv('https://github.com/wortell-smart-learning/python-data-fundamentals/raw/main/data/most_voted_titles_enriched.csv')

df.head(3)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genre1,genre2,genre3,url,averageRating,numVotes,metascore,country,primary_language,color,budget,opening_weekend_usa,gross_usa,cumulative_worldwide,tagline,summary,image_url
0,tt0010323,movie,The Cabinet of Dr. Caligari,Das Cabinet des Dr. Caligari,0,1920,,76.0,"Fantasy,Horror,Mystery",Fantasy,Horror,Mystery,https://www.imdb.com/title/tt0010323,8.1,57097,,Germany,,Black and White,"$18,000",,"$8,811","$8,811",You must become Caligari.,"Hypnotist Dr. Caligari uses a somnambulist, Ce...",https://m.media-amazon.com/images/M/MV5BNWJiNG...
1,tt0012349,movie,The Kid,The Kid,0,1921,,68.0,"Comedy,Drama,Family",Comedy,Drama,Family,https://www.imdb.com/title/tt0012349,8.3,112377,,USA,,Black and White,"$250,000",,,"$26,916",This is the great film he has been working on ...,"The Tramp cares for an abandoned child, but ev...",https://m.media-amazon.com/images/M/MV5BZjhhMT...
2,tt0013442,movie,Nosferatu,"Nosferatu, eine Symphonie des Grauens",0,1922,,94.0,"Fantasy,Horror",Fantasy,Horror,,https://www.imdb.com/title/tt0013442,7.9,88440,,Germany,,Black and White,,,,"$19,054",A thrilling mystery masterpiece - a chilling p...,Vampire Count Orlok expresses interest in a ne...,https://m.media-amazon.com/images/M/MV5BMTAxYj...


## Let's say we only want 1 column. How do we do that? Here are 2 ways:

### 1. Specifying the column you want: let's say we want to only look at the startYear column

In [2]:
df['startYear']

0       1920
1       1921
2       1922
3       1924
4       1925
        ... 
5825    2020
5826    2020
5827    2020
5828    2019
5829    2020
Name: startYear, Length: 5830, dtype: int64

### Specifying only 1 column gives you a Series

In [3]:
type(df['startYear'])

pandas.core.series.Series

### 2. The column names are also attributes, so you also use the dot notation

In [4]:
df.startYear

0       1920
1       1921
2       1922
3       1924
4       1925
        ... 
5825    2020
5826    2020
5827    2020
5828    2019
5829    2020
Name: startYear, Length: 5830, dtype: int64

### So selecting multiple columns can be done by using a list

In [5]:
columns_needed = ['tconst', 'averageRating', 'startYear']

df[columns_needed]

Unnamed: 0,tconst,averageRating,startYear
0,tt0010323,8.100,1920
1,tt0012349,8.300,1921
2,tt0013442,7.900,1922
3,tt0015324,8.200,1924
4,tt0015648,8.000,1925
...,...,...,...
5825,tt9686708,7.100,2020
5826,tt9698480,7.300,2020
5827,tt9777644,6.500,2020
5828,tt9806192,7.600,2019


### Let's say you only want titles with an average rating greater than 9.0. We need to use boolean vectors:

In [6]:
df['averageRating'] > 9.0

0       False
1       False
2       False
3       False
4       False
        ...  
5825    False
5826    False
5827    False
5828    False
5829    False
Name: averageRating, Length: 5830, dtype: bool

In [7]:
df[df['averageRating'] > 9.0].head(3)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genre1,genre2,genre3,url,averageRating,numVotes,metascore,country,primary_language,color,budget,opening_weekend_usa,gross_usa,cumulative_worldwide,tagline,summary,image_url
316,tt0068646,movie,The Godfather,The Godfather,0,1972,,175.0,"Crime,Drama",Crime,Drama,,https://www.imdb.com/title/tt0068646,9.2,1608367,100.0,USA,English,Color,"$6,000,000","$302,393,","$134,966,411","$246,120,986",An offer you can't refuse.,The aging patriarch of an organized crime dyna...,https://m.media-amazon.com/images/M/MV5BM2MyNj...
1214,tt0111161,movie,The Shawshank Redemption,The Shawshank Redemption,0,1994,,142.0,Drama,Drama,,,https://www.imdb.com/title/tt0111161,9.3,2327795,80.0,USA,English,Color,"$25,000,000","$727,327,","$28,699,976","$28,817,291",Fear can hold you prisoner. Hope can set you f...,Two imprisoned men bond over a number of years...,https://m.media-amazon.com/images/M/MV5BMDFkYT...
1712,tt0141842,tvSeries,The Sopranos,The Sopranos,0,1999,2007.0,55.0,"Crime,Drama",Crime,Drama,,https://www.imdb.com/title/tt0141842,9.2,297011,,USA,English,Color,,,,,Hell hath no fury like The Family. (season 5),New Jersey mob boss Tony Soprano deals with pe...,https://m.media-amazon.com/images/M/MV5BZGJjYz...


### But we want multiple conditions: average rating greater than 9 AND only movies:

In [8]:
(df['titleType'] == 'movie')

0        True
1        True
2        True
3        True
4        True
        ...  
5825     True
5826    False
5827     True
5828     True
5829     True
Name: titleType, Length: 5830, dtype: bool

In [9]:
(df['averageRating'] > 9.0)

0       False
1       False
2       False
3       False
4       False
        ...  
5825    False
5826    False
5827    False
5828    False
5829    False
Name: averageRating, Length: 5830, dtype: bool

In [13]:
df[(df['titleType'] == 'movie') & (df['averageRating'] > 9.0)].head(2)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genre1,genre2,genre3,url,averageRating,numVotes,metascore,country,primary_language,color,budget,opening_weekend_usa,gross_usa,cumulative_worldwide,tagline,summary,image_url
316,tt0068646,movie,The Godfather,The Godfather,0,1972,,175.0,"Crime,Drama",Crime,Drama,,https://www.imdb.com/title/tt0068646,9.2,1608367,100.0,USA,English,Color,"$6,000,000","$302,393,","$134,966,411","$246,120,986",An offer you can't refuse.,The aging patriarch of an organized crime dyna...,https://m.media-amazon.com/images/M/MV5BM2MyNj...
1214,tt0111161,movie,The Shawshank Redemption,The Shawshank Redemption,0,1994,,142.0,Drama,Drama,,,https://www.imdb.com/title/tt0111161,9.3,2327795,80.0,USA,English,Color,"$25,000,000","$727,327,","$28,699,976","$28,817,291",Fear can hold you prisoner. Hope can set you f...,Two imprisoned men bond over a number of years...,https://m.media-amazon.com/images/M/MV5BMDFkYT...


### But this gets tedious, so I myself prefer to use the dataframe method `.query()`

In [12]:
df.query("titleType == 'movie' and averageRating > 9").head(2)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genre1,genre2,genre3,url,averageRating,numVotes,metascore,country,primary_language,color,budget,opening_weekend_usa,gross_usa,cumulative_worldwide,tagline,summary,image_url
316,tt0068646,movie,The Godfather,The Godfather,0,1972,,175.0,"Crime,Drama",Crime,Drama,,https://www.imdb.com/title/tt0068646,9.2,1608367,100.0,USA,English,Color,"$6,000,000","$302,393,","$134,966,411","$246,120,986",An offer you can't refuse.,The aging patriarch of an organized crime dyna...,https://m.media-amazon.com/images/M/MV5BM2MyNj...
1214,tt0111161,movie,The Shawshank Redemption,The Shawshank Redemption,0,1994,,142.0,Drama,Drama,,,https://www.imdb.com/title/tt0111161,9.3,2327795,80.0,USA,English,Color,"$25,000,000","$727,327,","$28,699,976","$28,817,291",Fear can hold you prisoner. Hope can set you f...,Two imprisoned men bond over a number of years...,https://m.media-amazon.com/images/M/MV5BMDFkYT...


### One handy way of selecting strings still is using `.isin()`

In [14]:
df[df['genre1'].isin(['Crime', 'Drama'])].head(2)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genre1,genre2,genre3,url,averageRating,numVotes,metascore,country,primary_language,color,budget,opening_weekend_usa,gross_usa,cumulative_worldwide,tagline,summary,image_url
4,tt0015648,movie,Battleship Potemkin,Bronenosets Potemkin,0,1925,,75.0,"Drama,History,Thriller",Drama,History,Thriller,https://www.imdb.com/title/tt0015648,8.0,52800,97.0,Soviet Union,,Black and White,,"$5,641,","$51,198","$61,389",The Sensational Russian Film which is astoundi...,In the midst of the Russian Revolution of 1905...,https://m.media-amazon.com/images/M/MV5BMTEyMT...
6,tt0017136,movie,Metropolis,Metropolis,0,1927,,153.0,"Drama,Sci-Fi",Drama,Sci-Fi,,https://www.imdb.com/title/tt0017136,8.3,159189,98.0,Germany,German,Black and White,"DEM6,000,000","$19,386,","$1,236,166","$1,349,711",The most amazing picture ever made. Pictures a...,In a futuristic city sharply divided between t...,https://m.media-amazon.com/images/M/MV5BMTg5YW...


### Ok, ok, just one more thing: if you want to find a string in a text, you can use `.str.contains('your_text', case=False)`

In [15]:
df[df['originalTitle'].str.contains('godfather', case=False)]

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genre1,genre2,genre3,url,averageRating,numVotes,metascore,country,primary_language,color,budget,opening_weekend_usa,gross_usa,cumulative_worldwide,tagline,summary,image_url
316,tt0068646,movie,The Godfather,The Godfather,0,1972,,175.0,"Crime,Drama",Crime,Drama,,https://www.imdb.com/title/tt0068646,9.2,1608367,100.0,USA,English,Color,"$6,000,000","$302,393,","$134,966,411","$246,120,986",An offer you can't refuse.,The aging patriarch of an organized crime dyna...,https://m.media-amazon.com/images/M/MV5BM2MyNj...
354,tt0071562,movie,The Godfather: Part II,The Godfather: Part II,0,1974,,202.0,"Crime,Drama",Crime,Drama,,https://www.imdb.com/title/tt0071562,9.0,1122882,90.0,USA,English,Color,"$13,000,000","$171,417,","$47,834,595","$48,035,783",All the power on earth can't change destiny.,The early life and career of Vito Corleone in ...,https://m.media-amazon.com/images/M/MV5BMWMwMG...
912,tt0099674,movie,The Godfather: Part III,The Godfather: Part III,0,1990,,162.0,"Crime,Drama",Crime,Drama,,https://www.imdb.com/title/tt0099674,7.6,357561,60.0,USA,English,Color,"$54,000,000","$6,387,271,","$66,761,392","$136,766,062",Real power can't be given. It must be taken.,"Follows Michael Corleone, now in his 60s, as h...",https://m.media-amazon.com/images/M/MV5BNWFlYW...
