# Week 5 Day 1 - Pandas

[Pandas](https://pandas.pydata.org) is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.

In [1]:
import pandas as pd

In [2]:
# make a dictionary with lists as values
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

In [4]:
#make it a dataframe

myCars = pd.DataFrame(mydataset)
myCars

Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


In [7]:
#get the information of your dataframe

myCars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   cars      3 non-null      object
 1   passings  3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes


In [8]:
#get the shape of your dataframe

myCars.shape

(3, 2)

In [9]:
#get the columns

myCars.columns

Index(['cars', 'passings'], dtype='object')

In [10]:
#get the rows

myCars.values

array([['BMW', 3],
       ['Volvo', 7],
       ['Ford', 2]], dtype=object)

In [11]:
#get the axis

myCars.axes

[RangeIndex(start=0, stop=3, step=1),
 Index(['cars', 'passings'], dtype='object')]

In [13]:
#get the first row
myCars.loc[0]

cars        BMW
passings      3
Name: 0, dtype: object

In [15]:
#use a list of indexs:

myCars.loc[[0,2]]

Unnamed: 0,cars,passings
0,BMW,3
2,Ford,2


In [17]:
#get a column

myCars['cars']

0      BMW
1    Volvo
2     Ford
Name: cars, dtype: object

In [18]:
#get the rows

myCars.loc[[0,1,2]]

Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


In [19]:
#what are the datatypes of the dataframe

myCars.dtypes

cars        object
passings     int64
dtype: object

In [23]:
#change 'passings' to float
#set a new variable reference "passing" if you want change to carry

myCars['passings'].astype('float')

0    3.0
1    7.0
2    2.0
Name: passings, dtype: float64

In [22]:
#look at it again

myCars.dtypes

cars        object
passings     int64
dtype: object

In [27]:
#get rid of the last row
newDf = myCars.drop(2)

In [28]:
#get rid of the passings column

newDf2 = newDf.drop('passings',axis='columns')
newDf2

Unnamed: 0,cars
0,BMW
1,Volvo


<!--  -->

### NaN & empty data

In [30]:
import numpy as np

uglyData = {
  'cars': ["BMW", 'Jeep', "Ford", 'Chrysler'],
  'passings': [3, np.nan, 2, 'NaN']
}

uglyDF = pd.DataFrame(uglyData)
uglyDF

Unnamed: 0,cars,passings
0,BMW,3.0
1,Jeep,
2,Ford,2.0
3,Chrysler,


In [32]:
#drop the Nan

uglyDF_clean = uglyDF.dropna()
uglyDF_clean

Unnamed: 0,cars,passings
0,BMW,3.0
2,Ford,2.0
3,Chrysler,


In [36]:
#replace it with 0

uglyDF_clean2 = uglyDF.fillna(0)
uglyDF_clean2

Unnamed: 0,cars,passings
0,BMW,3.0
1,Jeep,0.0
2,Ford,2.0
3,Chrysler,


<!--  -->

### .csv Files

**pd.read_csv** 

A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

*pd.read_csv(filepath_or_buffer, sep=’ ,’ , header=’infer’,  index_col=None, usecols=None, engine=None, skiprows=None, nrows=None)*

In [39]:
#import tv_shows.csv

df = pd.read_csv("data/tv_shows.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'data/tv_shows.csv'

In [20]:
#get a preview


In [21]:
#what columns of the csv file


In [22]:
# Return the number of not empty cells for each column/row


In [23]:
#only import the columns ["title", "year", "rating", "votes"]


In [24]:
#what is the maximum rating? 


In [25]:
#what is the minimum rating?


In [26]:
#find the avg of all of the ratings


In [27]:
#what data types of the dataframe? 


In [28]:
#can you change the datatype of votes?


In [29]:
#set the index to be the names of the tv show


In [30]:
#get the year of hte new dataframe


<!--  -->

### .json files

**pd.read_json**

JSON = Python Dictionary

JSON objects have the same format as Python dictionaries. If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly.

In [99]:
dataJson = {
    'item1':{
        "0":60,
        "1":60,
        "2":60
    },
    'item2':{
        '0':100,
        '1':100,
        '2':100
    }
}



In [31]:
#read in nationalParks.json


In [32]:
#only get the ['date_established_readable','description', 'title', 'visitors', 'world_heritage_site', 'states ]


In [33]:
#get the datatypes

In [34]:
#find the national parks that are world heritage sites


<!--  -->

#### Exercise 1: get the national parks with over a million visitors

<!--  -->

### Exercise 2: Create a dictionary of the number of national parks per state
