<br>

**CSV** -> Comma seperate value.

**Serics:** A Series is a one-dimensional labeled array capable of holding any data type (e.g., integers, floats, strings, or objects).

**DataFrame:** A DataFrame is a two-dimensional labeled data structure with columns of potentially different data types.

<br>

In [35]:

# read csv file: 
import numpy as np 
import pandas as pd 

state = pd.read_csv(filepath_or_buffer="dataset/sales.csv")
state 

Unnamed: 0,rating,shipping_zip,billing_zip
0,5.0,,81220.0
1,4.5,94931.0,94931.0
2,,92625.0,92625.0
3,4.5,10003.0,10003.0
4,4.0,,92660.0
5,,,
6,,60007.0,60007.0




1. **id**: A unique identifier for each house sale record. It’s a numerical value used to distinguish individual transactions. (Example: `7129300520`)
2. **date**: The date when the house was sold, in the format `YYYYMMDDT000000` (e.g., `20141013T000000` means October 13, 2014). This can be used to analyze temporal trends in house prices.
3. **price**: The sale price of the house in USD (e.g., `221900`). This is typically the target variable for predictive modeling.
4. **bedrooms**: The number of bedrooms in the house (e.g., `3`). This indicates the house’s size and capacity.
5. **bathrooms**: The number of bathrooms, where partial bathrooms (e.g., 0.25, 0.5, 0.75) represent fixtures like a sink or toilet without a full bath or shower (e.g., `1.5` means 1 full bathroom and 1 half-bathroom).
6. **sqft_living**: The square footage of the interior living space, excluding non-living areas like garages or unfinished basements unless used as living space (e.g., `1180`).
7. **sqft_lot**: The total square footage of the lot or land on which the house is built, including the house, yard, driveway, etc. (e.g., `5650`).
8. **floors**: The number of floors in the house, where partial floors (e.g., `1.5`) indicate a split-level or additional smaller levels like a loft (e.g., `1`).
9. **waterfront**: A binary indicator (0 or 1) showing whether the house has a waterfront view (e.g., `0` means no waterfront).
10. **view**: A categorical score (0–4) indicating the quality of the view from the house (e.g., `0` means no view, `4` means an excellent view, such as of mountains or water).
11. **condition**: A score (1–5) representing the house’s condition, where `1` is poor and `5` is excellent (e.g., `3` indicates average condition).
12. **grade**: A score (1–13) assigned by the King County grading system, reflecting the quality of construction and design, with higher values indicating better quality (e.g., `7` is average, `11` is high-end).
13. **sqft_above**: The square footage of the house’s living space above ground level, excluding the basement (e.g., `1180`).
14. **sqft_basement**: The square footage of the basement, whether finished or unfinished. A value of `0` means no basement (e.g., `0`).
15. **yr_built**: The year the house was built (e.g., `1955`).
16. **yr_renovated**: The year the house was last renovated, with `0` indicating no renovation since construction (e.g., `0`).
17. **zipcode**: The postal code of the house’s location, indicating its geographic area within King County (e.g., `98178`).
18. **lat**: The latitude coordinate of the house’s location, useful for spatial analysis (e.g., `47.5112`).
19. **long**: The longitude coordinate of the house’s location (e.g., `-122.257`).
20. **sqft_living15**: The average square footage of the interior living space of the 15 nearest neighboring houses, providing context about the surrounding area (e.g., `1340`).
21. **sqft_lot15**: The average square footage of the lot size of the 15 nearest neighboring houses (e.g., `5650`).

<br>
<br>

`sqft_living = sqft_above(Above” মানে মাটির উপরে) + sqft_basement(underground)`

**sqft_living15', 'sqft_lot15', আশে পাশে 15 টা বাড়ির average living area and sqft_15**

<br>
<br>

In [7]:

pd.set_option("display.max_columns",10)
pd.set_option("display.max_rows",10)

houses = pd.read_csv(filepath_or_buffer="dataset/kc_house_data.csv")
houses

Unnamed: 0,id,date,price,bedrooms,bathrooms,...,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.00,...,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,...,98125,47.7210,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.00,...,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.00,...,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.00,...,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...
21608,263000018,20140521T000000,360000.0,3,2.50,...,98103,47.6993,-122.346,1530,1509
21609,6600060120,20150223T000000,400000.0,4,2.50,...,98146,47.5107,-122.362,1830,7200
21610,1523300141,20140623T000000,402101.0,2,0.75,...,98144,47.5944,-122.299,1020,2007
21611,291310100,20150116T000000,400000.0,3,2.50,...,98027,47.5345,-122.069,1410,1287


In [8]:

type(houses)


pandas.core.frame.DataFrame

In [9]:

houses.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [None]:
# total number of rows
len(houses)

21613

In [13]:
# row,clolumn
houses.shape

(21613, 21)

In [14]:

# size=> row*column 
houses.size 

453873

In [16]:

# alternative of pd.set_option
pd.options.display.max_columns = houses.shape[1]
houses 

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.00,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.00,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.00,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.00,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,263000018,20140521T000000,360000.0,3,2.50,1530,1131,3.0,0,0,3,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,6600060120,20150223T000000,400000.0,4,2.50,2310,5813,2.0,0,0,3,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,3,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,291310100,20150116T000000,400000.0,3,2.50,1600,2388,2.0,0,0,3,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287


In [21]:

pd.options.display.max_columns = 5 
# head and tail.
print("<--- First 10 values from the dataFrame---> \n",houses.head())
print("<---First 10 vales of the dataFrame---> \n",houses.head(5))
print("<---data type of first five values---> \n",type(houses.head(5)))
print()
print("<---Last 5 values---> \n",houses.tail())
print("<---Last 10 values---> \n ",houses.tail(10))

<--- First 10 values from the dataFrame---> 
            id             date  ...  sqft_living15  sqft_lot15
0  7129300520  20141013T000000  ...           1340        5650
1  6414100192  20141209T000000  ...           1690        7639
2  5631500400  20150225T000000  ...           2720        8062
3  2487200875  20141209T000000  ...           1360        5000
4  1954400510  20150218T000000  ...           1800        7503

[5 rows x 21 columns]
<---First 10 vales of the dataFrame---> 
            id             date  ...  sqft_living15  sqft_lot15
0  7129300520  20141013T000000  ...           1340        5650
1  6414100192  20141209T000000  ...           1690        7639
2  5631500400  20150225T000000  ...           2720        8062
3  2487200875  20141209T000000  ...           1360        5000
4  1954400510  20150218T000000  ...           1800        7503

[5 rows x 21 columns]
<---data type of first five values---> 
 <class 'pandas.core.frame.DataFrame'>

<---Last 5 values---> 
       

In [22]:

# type of all field in pandas:
houses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

In [23]:

state.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   rating        4 non-null      float64
 1   shipping_zip  4 non-null      float64
 2   billing_zip   6 non-null      float64
dtypes: float64(3)
memory usage: 300.0 bytes


In [25]:
state.dtypes

rating          float64
shipping_zip    float64
billing_zip     float64
dtype: object

<br>
<br>

# #work with titanic: 

<br>
<br>

In [32]:

tatinic = pd.read_csv("dataset/titanic.csv")

pd.options.display.max_columns = tatinic.shape[1]
tatinic.sample(5)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
955,3,0,"Lefebre, Miss. Ida",female,?,3,1,4133,25.4667,?,S,?,?,?
119,1,1,"Frauenthal, Dr. Henry William",male,50,2,0,PC 17611,133.65,?,S,5,?,"New York, NY"
469,2,1,"Keane, Miss. Nora A",female,?,0,0,226593,12.35,E101,Q,10,?,"Harrisburg, PA"
945,3,1,"Lam, Mr. Ali",male,?,0,0,1601,56.4958,?,S,C,?,?
762,3,1,"Dean, Master. Bertram Vere",male,1,1,2,C.A. 2315,20.575,?,S,10,?,"Devon, England Wichita, KS"


In [34]:
""" 
Here, ? is null value, we need to replace ? with null value 
"""
tatinic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  boat       1309 non-null   object
 12  body       1309 non-null   object
 13  home.dest  1309 non-null   object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB


In [37]:

tatinic.replace("?",np.nan,inplace=True)

In [None]:

# now, see the age, cabin, boat, body and hom.dest
tatinic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1046 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1308 non-null   object
 9   cabin      295 non-null    object
 10  embarked   1307 non-null   object
 11  boat       486 non-null    object
 12  body       121 non-null    object
 13  home.dest  745 non-null    object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB


<Br>
<Br>

# #Netflix data csv file: 

- seperator in pd.read_csv()
- index_col in pd.read_csv()

<Br>
<Br>

In [45]:

netflix = pd.read_csv(filepath_or_buffer="dataset/netflix_titles.csv",sep="|",index_col=0)
netflix

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


<Br>
<Br>

# #nst dataset:

- how to give column name while importing data 

<Br>
<Br>

In [59]:

# change column name while reading files: 

columns_name = ["Sum","Reg","Div","State","Name","Census", "Estimate","Pop","pop10","pop11","pop12","pop13","pop14","pop15","pop16","pop17","pop19","pop20","pop21"]

nst = pd.read_csv(filepath_or_buffer="dataset/nst-est2020.csv",names=columns_name)

pd.options.display.max_columns = nst.shape[1]
nst.head(2)

# But we want to replace the column name: 

Unnamed: 0,Sum,Reg,Div,State,Name,Census,Estimate,Pop,pop10,pop11,pop12,pop13,pop14,pop15,pop16,pop17,pop19,pop20,pop21
0,SUMLEV,REGION,DIVISION,STATE,NAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015,POPESTIMATE2016,POPESTIMATE2017,POPESTIMATE2018,POPESTIMATE2019,POPESTIMATE042020,POPESTIMATE2020
1,010,0,0,00,United States,308745538,308758105,309327143,311583481,313877662,316059947,318386329,320738994,323071755,325122128,326838199,328329953,329398742,329484123


In [56]:

nst = pd.read_csv(filepath_or_buffer="dataset/nst-est2020.csv",names=columns_name,header=0)
nst.head(2)

Unnamed: 0,Sum,Reg,Div,State,Name,Census,Estimate,Pop,pop10,pop11,pop12,pop13,pop14,pop15,pop16,pop17,pop19,pop20,pop21
0,10,0,0,0,United States,308745538,308758105,309327143,311583481,313877662,316059947,318386329,320738994,323071755,325122128,326838199,328329953,329398742,329484123
1,20,1,0,0,Northeast Region,55317240,55318414,55380764,55608318,55782661,55912775,56021339,56052790,56063777,56083383,56084543,56002934,55924275,55849869


In [53]:

# change file name after reading files: 
nst.columns = ["sumlev","region","division","state","name","census", "estimates","popestimate","pop10","pop11","pop12","pop13","pop14","pop15","pop16","pop17","pop19","pop20","pop21"]
nst.head(2) 


Unnamed: 0,sumlev,region,division,state,name,census,estimates,popestimate,pop10,pop11,pop12,pop13,pop14,pop15,pop16,pop17,pop19,pop20,pop21
0,SUMLEV,REGION,DIVISION,STATE,NAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015,POPESTIMATE2016,POPESTIMATE2017,POPESTIMATE2018,POPESTIMATE2019,POPESTIMATE042020,POPESTIMATE2020
1,010,0,0,00,United States,308745538,308758105,309327143,311583481,313877662,316059947,318386329,320738994,323071755,325122128,326838199,328329953,329398742,329484123



#  #DataFrame Basics Exercise


## Part 1
* Use pandas to read the `bestsellers` dataset into a DataFrame 
* Once you've done that, use Pandas to figure out how many rows and columns the DF has
* Inspect the first 5 rows
* Inspect the first 19 rows
* Inspect the last 5 rows
* Inspect the last 2 rows 
* Which columns (if any) are missing values?
* What datatype did Pandas assign to "User Rating"?
* How many integer columns are in the DataFrame?

In [62]:
bestsellers = pd.read_csv("dataset/bestsellers.csv")

# first 5 rows: 
bestsellers.head(5)

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
0,10-Day Green Smoothie Cleanse,JJ Smith,4.7,17350,8,2016,Non Fiction
1,11/22/63: A Novel,Stephen King,4.6,2052,22,2011,Fiction
2,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,4.7,18979,15,2018,Non Fiction
3,1984 (Signet Classics),George Orwell,4.7,21424,6,2017,Fiction
4,"5,000 Awesome Facts (About Everything!) (Natio...",National Geographic Kids,4.8,7665,12,2019,Non Fiction


In [63]:

# first 19 rows: 
bestsellers.head(19)

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
0,10-Day Green Smoothie Cleanse,JJ Smith,4.7,17350,8,2016,Non Fiction
1,11/22/63: A Novel,Stephen King,4.6,2052,22,2011,Fiction
2,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,4.7,18979,15,2018,Non Fiction
3,1984 (Signet Classics),George Orwell,4.7,21424,6,2017,Fiction
4,"5,000 Awesome Facts (About Everything!) (Natio...",National Geographic Kids,4.8,7665,12,2019,Non Fiction
...,...,...,...,...,...,...,...
14,"Act Like a Lady, Think Like a Man: What Men Re...",Steve Harvey,4.6,5013,17,2009,Non Fiction
15,Adult Coloring Book Designs: Stress Relief Col...,Adult Coloring Book Designs,4.5,2313,4,2016,Non Fiction
16,Adult Coloring Book: Stress Relieving Animal D...,Blue Star Coloring,4.6,2925,6,2015,Non Fiction
17,Adult Coloring Book: Stress Relieving Patterns,Blue Star Coloring,4.4,2951,6,2015,Non Fiction


In [64]:

# last 5 rows:
bestsellers.tail(5)

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
545,Wrecking Ball (Diary of a Wimpy Kid Book 14),Jeff Kinney,4.9,9413,8,2019,Fiction
546,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2016,Non Fiction
547,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2017,Non Fiction
548,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2018,Non Fiction
549,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2019,Non Fiction


In [65]:

# last 2 row:
bestsellers.tail(2)


Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
548,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2018,Non Fiction
549,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2019,Non Fiction


In [67]:

#finding missing values:  # we have no missing values
bestsellers.isna().sum()

Name           0
Author         0
User Rating    0
Reviews        0
Price          0
Year           0
Genre          0
dtype: int64

In [74]:

# What data type assing to: User_Rating:
bestsellers["User Rating"].dtype

dtype('float64')

In [68]:

# check null value with info():
bestsellers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550 entries, 0 to 549
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Name         550 non-null    object 
 1   Author       550 non-null    object 
 2   User Rating  550 non-null    float64
 3   Reviews      550 non-null    int64  
 4   Price        550 non-null    int64  
 5   Year         550 non-null    int64  
 6   Genre        550 non-null    object 
dtypes: float64(1), int64(3), object(3)
memory usage: 30.2+ KB


In [71]:

# count the number of integer columns: 
len(bestsellers.select_dtypes(include=["int64"]).columns)


3

## Part 2

* The `mount_everest_deaths` dataset has its own index column provided in the dataset.  When importing it, use the existing index column.
* Which columns have zero null values?
* Which column has the most null values?


In [76]:

everest = pd.read_csv(filepath_or_buffer="dataset/mount_everest_deaths.csv")
everest.head(5)


Unnamed: 0,No.,Name,Date,Age,Expedition,Nationality,Cause of death,Location
0,1,Dorje,"June 7, 1922",,1922 British Mount Everest Expedition,Nepal,Avalanche,Below North Col
1,2,Lhakpa,"June 7, 1922",,1922 British Mount Everest Expedition,Nepal,Avalanche,Below North Col
2,3,Norbu,"June 7, 1922",,1922 British Mount Everest Expedition,Nepal,Avalanche,Below North Col
3,4,Pasang,"June 7, 1922",,1922 British Mount Everest Expedition,Nepal,Avalanche,Below North Col
4,5,Pema,"June 7, 1922",,1922 British Mount Everest Expedition,Nepal,Avalanche,Below North Col


In [88]:

everest.isnull().sum()

No.                 0
Name                0
Date                0
Age               150
Expedition         39
Nationality         1
Cause of death     14
Location           19
dtype: int64

In [90]:

# which columns has zero null values: 
everest.isnull().sum()[everest.isnull().sum()==0]

No.     0
Name    0
Date    0
dtype: int64

In [91]:

# which column has most null values: 
everest.isnull().sum()[everest.isnull().sum()==max(everest.isnull().sum())]

Age    150
dtype: int64

## Part 3
* Import the `movie_titles.tsv` dataset
* You'll notice that it is not comma-separated! You'll need to tell `read_csv` what the separator actually is.
* The dataset does not come with its own column headings, so you'll need to provide those as well.  The columns are, in order, `id`, `title`, `year`, `imdb_rating`, `imdb_id`, and `genres`
* Once you have successfully read the dataset into a DataFrame, inspect the last 7 rows!

In [99]:

# tsv: tab seperate value:
columns = ["id","title","year","imdb_rating","imdb_id","genres"]

movies_titles = pd.read_csv(filepath_or_buffer="dataset/movie_titles.tsv",sep="\t",names=columns)
movies_titles

Unnamed: 0,id,title,year,imdb_rating,imdb_id,genres
0,m0,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
1,m1,1492: conquest of paradise,1992,6.2,10421.0,['adventure' 'biography' 'drama' 'history']
2,m2,15 minutes,2001,6.1,25854.0,['action' 'crime' 'drama' 'thriller']
3,m3,2001: a space odyssey,1968,8.4,163227.0,['adventure' 'mystery' 'sci-fi']
4,m4,48 hrs.,1982,6.9,22289.0,['action' 'comedy' 'crime' 'drama' 'thriller']
...,...,...,...,...,...,...
612,m612,watchmen,2009,7.8,135229.0,['action' 'crime' 'fantasy' 'mystery' 'sci-fi'...
613,m613,xxx,2002,5.6,53505.0,['action' 'adventure' 'crime']
614,m614,x-men,2000,7.4,122149.0,['action' 'sci-fi']
615,m615,young frankenstein,1974,8.0,57618.0,['comedy' 'sci-fi']


In [102]:

# read the dataset without the last 7 rows:
print(movies_titles.shape)
movies_titles[:-7]

(617, 6)


Unnamed: 0,id,title,year,imdb_rating,imdb_id,genres
0,m0,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
1,m1,1492: conquest of paradise,1992,6.2,10421.0,['adventure' 'biography' 'drama' 'history']
2,m2,15 minutes,2001,6.1,25854.0,['action' 'crime' 'drama' 'thriller']
3,m3,2001: a space odyssey,1968,8.4,163227.0,['adventure' 'mystery' 'sci-fi']
4,m4,48 hrs.,1982,6.9,22289.0,['action' 'comedy' 'crime' 'drama' 'thriller']
...,...,...,...,...,...,...
605,m605,who's your daddy?,2003/I,4.5,2267.0,['comedy']
606,m606,wild things,1998,6.6,40523.0,['crime' 'mystery' 'thriller']
607,m607,wild wild west,1999,4.3,54943.0,['action' 'western' 'comedy' 'sci-fi']
608,m608,willow,1988,7.1,33506.0,['action' 'adventure' 'drama' 'fantasy' 'roman...
