# Inspecting Data with Pandas

### 1. Import needed package(s)
- At the start of any notebook or script we do all of the import statements.
- In this case we are not only importing the library but also giving it an alias `pd`
- The alias is used in order to save time while coding

In [1]:
import pandas as pd

### 2. Reading Data

Pandas has a function to load in a `.csv` or other textbased files as a DataFrame. This week `.txt` files are used in the project

In [3]:
df_1880 = pd.read_csv('../data/yob1880.txt')

### 3. Attributes and Methods

Let's use the `.head()` *method* to take a look at the first 5 lines in the dataframe. It can be called on any DataFrame object by the dot, followed by the method name and parentheses. By default it returns the first 5 rows of the DataFrame. 

In [4]:
df_1880.head()

Unnamed: 0,Mary,F,7065
0,Anna,F,2604
1,Emma,F,2003
2,Elizabeth,F,1939
3,Minnie,F,1746
4,Margaret,F,1578


Something is a bit off with our dataframe. What could it be?
- Although the intention of the `.head()` method is to show the first 5 rows, 6 are show here!
- This is because the first row was expect to be the `header` of the dataframe when in fact it is a row of data also called an observation
- Most datasets will come with the header included. If not the a header can be passed when reading in the data

When reading in the data add the parameter `names` to the command with a list of the column names

In [5]:
df_1880 = pd.read_csv('../data/yob1880.txt', names=['name', 'gender', 'frequency'])

Now using the `.head()` method we can see the header and the data in the first 5 rows of the dataframe.

In [6]:
df_1880.head()

Unnamed: 0,name,gender,frequency
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003
3,Elizabeth,F,1939
4,Minnie,F,1746


`.shape` is an *attribute*. It can be used with any dataset using a dot. It shows the number of rows and columns in a DataFrame as a Python *tuple*:

In [7]:
df_1880.shape

(2000, 3)

Notice that methods have `()` and attributes do not. Attributes tell us something about the object. In this case the object is a dataframe with population statistics. The shape attribute tells use the number of rows and columns. Methods do something to the object and can also be give parameters. As you can see the default to the `.head()` method is to show 5 rows. We can pass a different number such as 3 and it will only show 3 rows. 

In [8]:
df_1880.head(3)

Unnamed: 0,name,gender,frequency
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003


### 4. Access to index and column names

In [9]:
df_1880.index

RangeIndex(start=0, stop=2000, step=1)

This tells us that the index is a numeric index from 0 to 2000 using a stepwise value of 1

In [10]:
df_1880.columns

Index(['name', 'gender', 'frequency'], dtype='object')

`.columns` returns the column names as a *list-like* Series. We can access each element as we would in a python list:

In [11]:
df_1880.columns[0]

'name'

### 5. Information about each column

- `.info()` give us the non-Null count of each column and the data type of each column.
- Null would imply that there is no data in a cell also referred to a NaN, this is considered **Missing Data**

In [12]:
df_1880.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   name       2000 non-null   object
 1   gender     2000 non-null   object
 2   frequency  2000 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 47.0+ KB


### 6. Pandas `.to_csv()` parameters

- Since changes have been made to the data it would be clever to save them in case we want to use the data in this format at another point in time.
- This can be done with the `.to_csv` method
- `.to_csv()` writes the data to a user choosen file name and path

In [14]:
df_1880.to_csv('./df_1880.csv')

Check to see what the data looks like when it is read back into the notebook.

In [15]:
df_1880_new = pd.read_csv('./df_1880.csv')
df_1880_new.head()

Unnamed: 0.1,Unnamed: 0,name,gender,frequency
0,0,Mary,F,7065
1,1,Anna,F,2604
2,2,Emma,F,2003
3,3,Elizabeth,F,1939
4,4,Minnie,F,1746


What happened? Why this there now an additional 'unnamed' column?

- When writing the data the index was also written to the `.csv` file
- This can be avoided by setting the `.to_csv()` parameter to `index=False`

In [16]:
df_1880.to_csv('./df_1880.csv', index=False)

In [17]:
df_1880 = pd.read_csv('./df_1880.csv')
df_1880.head()

Unnamed: 0,name,gender,frequency
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003
3,Elizabeth,F,1939
4,Minnie,F,1746


### 7. Try again on your own with the year 2000

We currently have a dataframe for the year 1880 ready to use. Let's also read in one for the year 2000. Do you remember how to read in a dataframe with pandas?

**Note:** replace the ??? with the correct pandas commands and make sure the dataframe as a relevant header.

In [20]:
df_2000 = pd.read_csv('../data/yob2000.txt', names = ['name', 'gender', 'frequency'])

To ensure the data has been read in correctly it is always clever to look at the first few rows of the dataframe. How did we do that?

In [21]:
df_2000.head()

Unnamed: 0,name,gender,frequency
0,Emily,F,25957
1,Hannah,F,23085
2,Madison,F,19968
3,Ashley,F,17997
4,Sarah,F,17708


Perform the same steps as were perfromed on the 1880 dataset to save the dataset as a `.csv` file and with the correct header (column names)

In [22]:
df_2000.to_csv('./df_2000.csv', index=False)

In [23]:
df_2000 = pd.read_csv('./df_2000.csv')
df_2000.head()

Unnamed: 0,name,gender,frequency
0,Emily,F,25957
1,Hannah,F,23085
2,Madison,F,19968
3,Ashley,F,17997
4,Sarah,F,17708


## Short review

Using the df_2000 dataframe try to use the correct command to give you the following results

In [24]:
# 1. display the DataFrame
df_2000

Unnamed: 0,name,gender,frequency
0,Emily,F,25957
1,Hannah,F,23085
2,Madison,F,19968
3,Ashley,F,17997
4,Sarah,F,17708
...,...,...,...
29771,Zeph,M,5
29772,Zeven,M,5
29773,Ziggy,M,5
29774,Zo,M,5


In [25]:
# 2. display the first 5 rows
df_2000.head()

Unnamed: 0,name,gender,frequency
0,Emily,F,25957
1,Hannah,F,23085
2,Madison,F,19968
3,Ashley,F,17997
4,Sarah,F,17708


In [26]:
# 3. display the last 5 rows
df_2000.tail()

Unnamed: 0,name,gender,frequency
29771,Zeph,M,5
29772,Zeven,M,5
29773,Ziggy,M,5
29774,Zo,M,5
29775,Zyier,M,5


In [27]:
# 4. display the number of rows and columns
df_2000.shape

(29776, 3)

In [28]:
# 5. list the column names
df_2000.columns

Index(['name', 'gender', 'frequency'], dtype='object')

In [29]:
# 6. list the row index
df_2000.index # ???

RangeIndex(start=0, stop=29776, step=1)

In [31]:
# 7. display the column types
df_2000.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29776 entries, 0 to 29775
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   name       29776 non-null  object
 1   gender     29776 non-null  object
 2   frequency  29776 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 698.0+ KB


## License
(c) 2022 Samuel McGuire.
Distributed under the conditions of the MIT License.