# Viewing and inspecting data with pandas

## Libraries

In [2]:
import pandas as pd
from openpyxl.workbook import workbook

### Manipulating the data frame is key to getting what you want out of the data
* Some basic selection
* viewing functions: ````read_csv````,````columns````,````iloc````,````to_excel````
* Saving desire values to an Excel sheet
* work woth the CSV file named Names.csv

## Problem Statement

* Imagine  you had to deal with a spreedsheet with so many columns that it was hard to fully read the data through your terminal.

* You need to know which columns contain what so that you can access the data you need to proceed.  In order to do this all we need to do is use the same function we used to assign the columns. 


In [3]:
df_csv = pd.read_csv('Exercise_Files/Names.csv', header=None)
df_csv

Unnamed: 0,0,1,2,3,4,5,6
0,John,Doe,120 jefferson st.,Riverside,NJ,8074,45000
1,Jack,McGinnis,220 hobo Av.,Phila,PA,9119,18000
2,"John ""Da Man""",Repici,120 Jefferson St.,Riverside,NJ,8075,120000
3,Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD,91234,90000
4,,Blankman,,SomeTown,SD,298,30000
5,"Joan ""Danger"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,123,68000


### 1. How to add headers to a data frame

* As you can see, there're not columns defined in data frame
* Using command ````columns````, you can add column's names to data frame

In [4]:
df_csv.columns = ['First','Last', 'Address','City','State','Area','Code']
df_csv

Unnamed: 0,First,Last,Address,City,State,Area,Code
0,John,Doe,120 jefferson st.,Riverside,NJ,8074,45000
1,Jack,McGinnis,220 hobo Av.,Phila,PA,9119,18000
2,"John ""Da Man""",Repici,120 Jefferson St.,Riverside,NJ,8075,120000
3,Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD,91234,90000
4,,Blankman,,SomeTown,SD,298,30000
5,"Joan ""Danger"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,123,68000


* You must length match to avoid errors: ````ValueError: Length mismatch: Expected axis has 7 elements, new values have 6 elements````

### 2. How to view the values of a column

* Now, let's say you want to view just one column
* You can index by column name, in this case you want to view the data of the State column

In [5]:
# example using single []
df_csv['State']

0     NJ
1     PA
2     NJ
3     SD
4     SD
5     CO
Name: State, dtype: object

In [6]:
#Example using double []
df_csv[['State']]

Unnamed: 0,State
0,NJ
1,PA
2,NJ
3,SD
4,SD
5,CO


* You can view the values and their indices

### 3. How to view multiple column's data

* If you want to access multiple column's data you just pass it in as a list
* Se usan los doble ````[[]]````,```` [['State,'Code']]```` para representar el index del data frame como una lista de columnas

In [7]:
df_csv[['State','Code']]

Unnamed: 0,State,Code
0,NJ,45000
1,PA,18000
2,NJ,120000
3,SD,90000
4,SD,30000
5,CO,68000


### 4. How to use slicing to view values of the 3 first lines of a single column

* Now if we had a large set of data and we can view only certain value we can achieve this by slicing
* By slicing you can choice the column number and the row number as a coordinate system to view value
* In this case we want to view the value of the row 3
* In this case we want to view the 3 first lines of the column ````First````, it is indexed from 0 to 2

In [8]:
print(df_csv['First'][0:3])

0             John
1             Jack
2    John "Da Man"
Name: First, dtype: object


### 5. How to view values from a single row
* Use the Integer Location function -> iloc()
* By slicing you can choice the row number you can access
* In this case you can access to value of row number 4
* the correct syntax is: ````dataframe.iloc[4]````, data frame  + iloc function + number of index into the ````[]````

In [11]:
# single []
df_csv.iloc[4]

First           NaN
Last       Blankman
Address         NaN
City       SomeTown
State            SD
Area            298
Code          30000
Name: 4, dtype: object

In [12]:
# double []
df_csv.iloc[[4]]

Unnamed: 0,First,Last,Address,City,State,Area,Code
4,,Blankman,,SomeTown,SD,298,30000


In [18]:
#df_csv.iloc[df_csv.index]
#df_csv.index

In [14]:
print(df_csv.iloc[4])

First           NaN
Last       Blankman
Address         NaN
City       SomeTown
State            SD
Area            298
Code          30000
Name: 4, dtype: object


### 6. How to view values from a single cell
* By slicing chose the row number and column number
* Remember to start by 0
* You can access for the value on row 3, column 1
* From data frame chose the value "Tyler"

In [16]:
df_csv

Unnamed: 0,First,Last,Address,City,State,Area,Code
0,John,Doe,120 jefferson st.,Riverside,NJ,8074,45000
1,Jack,McGinnis,220 hobo Av.,Phila,PA,9119,18000
2,"John ""Da Man""",Repici,120 Jefferson St.,Riverside,NJ,8075,120000
3,Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD,91234,90000
4,,Blankman,,SomeTown,SD,298,30000
5,"Joan ""Danger"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,123,68000


* Value "Tyler" is located on column 2 and row 3
* Use slicing combined with iloc()function

In [32]:
#df_csv.iloc[df_csv.index].isnull() #to view what cell has null values

In [33]:
print(df_csv.iloc[3,1])

Tyler


### 7. How to select values, store and save them as a new data frame in an Excel file

* Imagine you can select a set of values from te data frame, for example the values of the columns: *First,LAst* and *State*
* You can store the values in a new excel file

In [18]:
# 1. Select the wanted values
df_csv[['First','Last','State']]

Unnamed: 0,First,Last,State
0,John,Doe,NJ
1,Jack,McGinnis,PA
2,"John ""Da Man""",Repici,NJ
3,Stephen,Tyler,SD
4,,Blankman,SD
5,"Joan ""Danger"", Anne",Jet,CO


In [11]:
# 2. Store the wanted value sin a variable
wanted_values = df_csv[['First','Last','State']]

In [None]:
# 3. Save the wanted values in an excel file
stored = wanted_values.to_excel('Exercise_Files/State_location.xlsx', index=None)

## Reference
[Vieweing and inspecting data](https://www.linkedin.com/learning/using-python-with-excel/viewing-and-inspecting-data-with-pandas?autoplay=true&resume=false&u=2134922)

In [96]:
value_null = df_csv.isnull().values.any()
rows_value_null = df_csv[df_csv.isnull().any(axis=1)]
def null_values():
    if value_null == True:
        return rows_value_null

In [97]:
null_values()

Unnamed: 0,First,Last,Address,City,State,Area,Code
4,,Blankman,,SomeTown,SD,298,30000


In [98]:
df_csv.dtypes[(df_csv.dtypes == "int64") | (df_csv.dtypes == "float64")]

Area    int64
Code    int64
dtype: object

In [99]:
columns_int_float = df_csv.dtypes[(df_csv.dtypes == "int64") | (df_csv.dtypes == "float64")]

In [112]:
print(columns_int_float)

Area    int64
Code    int64
dtype: object


In [121]:
for i,j in columns_int_float.items():
        print(i)

Area
Code


In [132]:
#df_csv.columns_int_float.fillna(0) # AttributeError
columns_int_float = df_csv.dtypes[(df_csv.dtypes == "int64") | (df_csv.dtypes == "float64")]
def column_numerical():
    for i,j in columns_int_float.items():
        columns_num = print(i)
    return columns_num

In [133]:
column_numerical()

Area
Code


In [134]:
#df_csv.columns_int_float.fillna(0) # AttributeError
columns_int_float = df_csv.dtypes[(df_csv.dtypes == "int64") | (df_csv.dtypes == "float64")]
def column_numerical():
    for i,j in columns_int_float.items():
        columns_num = print(i)
        #fill_values = df_csv.columns_num.fillna(0)
    return df_csv.{i}.fillna(0)

SyntaxError: invalid syntax (<ipython-input-134-1c7554bace72>, line 7)

In [131]:
column_numerical()

Area


AttributeError: 'DataFrame' object has no attribute 'columns_num'

In [148]:
#df_csv.columns_int_float.fillna(0) # AttributeError
columns_int_float = df_csv.dtypes[(df_csv.dtypes == "int64") | (df_csv.dtypes == "float64")]
def column_numerical2():
    for i,j in columns_int_float.items():
        return f'df_csv.{i}.fillna(0)'

In [149]:
column_numerical2()

'df_csv.Area.fillna(0)'

In [161]:
#df_csv.columns_int_float.fillna(0) # AttributeError
columns_int_float = df_csv.dtypes[(df_csv.dtypes == "int64") | (df_csv.dtypes == "float64")]
def column_numerical3():
    for i,j in columns_int_float.items():
        col = print(i)
    #return col
    def execute_func():
        f'df_csv.{column_numerical3()}.fillna(0)'
        
        return execute_fun()
    
    return 



In [162]:
column_numerical3()

Area
Code


In [None]:
print(columns_int_float)

In [None]:
# how to change NaN value to 0
df_csv.fillna(0)
df_csv.First.fillna(0)