# Structured Data
***

## Learning Objectives
- Understand the difference between structured and unstructured data
- Limitations of structured data
- Types of structured data
  
## Introduction
Structured data is data that is organized into columns and rows so that it can be accessed and modified efficiently.  Structured data is typically stored in a relational database, such as MySQL or PostgreSQL, or in a spreadsheet.  Structured data is also called **tabular data**.

Let's get started. We need to read the contents of the CSV file. How do we do that? Let's use the question mark to find out.


### CSV as structured data

In [10]:
import pandas as pd

df = pd.read_csv('sales_data_sample.csv', encoding='latin1')
# Get the first 5 rows
df.head()

Unnamed: 0,ORDERNUMBER,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,ORDERDATE,STATUS,QTR_ID,MONTH_ID,YEAR_ID,...,ADDRESSLINE1,ADDRESSLINE2,CITY,STATE,POSTALCODE,COUNTRY,TERRITORY,CONTACTLASTNAME,CONTACTFIRSTNAME,DEALSIZE
0,10107,30,95.7,2,2871.0,2/24/2003 0:00,Shipped,1,2,2003,...,897 Long Airport Avenue,,NYC,NY,10022.0,USA,,Yu,Kwai,Small
1,10121,34,81.35,5,2765.9,5/7/2003 0:00,Shipped,2,5,2003,...,59 rue de l'Abbaye,,Reims,,51100.0,France,EMEA,Henriot,Paul,Small
2,10134,41,94.74,2,3884.34,7/1/2003 0:00,Shipped,3,7,2003,...,27 rue du Colonel Pierre Avia,,Paris,,75508.0,France,EMEA,Da Cunha,Daniel,Medium
3,10145,45,83.26,6,3746.7,8/25/2003 0:00,Shipped,3,8,2003,...,78934 Hillside Dr.,,Pasadena,CA,90003.0,USA,,Young,Julie,Medium
4,10159,49,100.0,14,5205.27,10/10/2003 0:00,Shipped,4,10,2003,...,7734 Strong St.,,San Francisco,CA,,USA,,Brown,Julie,Medium


In [11]:
# Describe data 
df.describe()

Unnamed: 0,ORDERNUMBER,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,QTR_ID,MONTH_ID,YEAR_ID,MSRP
count,2823.0,2823.0,2823.0,2823.0,2823.0,2823.0,2823.0,2823.0,2823.0
mean,10258.725115,35.092809,83.658544,6.466171,3553.889072,2.717676,7.092455,2003.81509,100.715551
std,92.085478,9.741443,20.174277,4.225841,1841.865106,1.203878,3.656633,0.69967,40.187912
min,10100.0,6.0,26.88,1.0,482.13,1.0,1.0,2003.0,33.0
25%,10180.0,27.0,68.86,3.0,2203.43,2.0,4.0,2003.0,68.0
50%,10262.0,35.0,95.7,6.0,3184.8,3.0,8.0,2004.0,99.0
75%,10333.5,43.0,100.0,9.0,4508.0,4.0,11.0,2004.0,124.0
max,10425.0,97.0,100.0,18.0,14082.8,4.0,12.0,2005.0,214.0


In [8]:
# Get the columns of the df
df.columns

Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
       'CD Account', 'Online', 'CreditCard'],
      dtype='object')

In [12]:
# Remove the White Space in the column names
df.columns = df.columns.str.replace(' ', '')

In [9]:
df.shape

(5000, 14)

### Excel as structured data

In [13]:
import openpyxl

workbook = openpyxl.load_workbook("Bank_Personal_Loan_Modelling.xlsx")

sheet_names = workbook.sheetnames

for sheet_name in sheet_names:
    print(sheet_name)


Description
Data


In [16]:
# Load the values of the Data sheet into a dataframe
df = pd.read_excel("Bank_Personal_Loan_Modelling.xlsx", sheet_name="Data")

# Get the first 5 rows
df.head()

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


In [17]:
# Get the columns of the df
df.columns

Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
       'CD Account', 'Online', 'CreditCard'],
      dtype='object')

In [18]:
# Describe data
df.describe()

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,45.3384,20.1046,73.7742,93152.503,2.3964,1.937913,1.881,56.4988,0.096,0.1044,0.0604,0.5968,0.294
std,1443.520003,11.463166,11.467954,46.033729,2121.852197,1.147663,1.747666,0.839869,101.713802,0.294621,0.305809,0.23825,0.490589,0.455637
min,1.0,23.0,-3.0,8.0,9307.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1250.75,35.0,10.0,39.0,91911.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2500.5,45.0,20.0,64.0,93437.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,3750.25,55.0,30.0,98.0,94608.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,5000.0,67.0,43.0,224.0,96651.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


### SQL as structured data

In [6]:
# Load SQLlite table in a dataframe
import sqlite3
import pandas as pd

#Open a query to the database
conn = sqlite3.connect("chinook.db")


# Load the query in a dataframe
df = pd.read_sql_query("SELECT * FROM albums", conn)

# get the first 5 rows
df.head()


Unnamed: 0,AlbumId,Title,ArtistId
0,1,For Those About To Rock We Salute You,1
1,2,Balls to the Wall,2
2,3,Restless and Wild,2
3,4,Let There Be Rock,1
4,5,Big Ones,3


In [3]:
df.shape

(347, 3)

In [7]:
df.describe()

Unnamed: 0,AlbumId,ArtistId
count,347.0,347.0
mean,174.0,121.942363
std,100.314505,77.793131
min,1.0,1.0
25%,87.5,58.0
50%,174.0,112.0
75%,260.5,179.5
max,347.0,275.0


In [8]:
df.columns

Index(['AlbumId', 'Title', 'ArtistId'], dtype='object')

In [11]:
def get_number_of_sheets(file_path):    
    # Write docstring 
    """This function returns the number of sheets in an Excel file.""" 
    import openpyxl
    
    # Load the Excel file
    workbook = openpyxl.load_workbook(file_path)


    # Get the list of sheet names
    sheet_names = workbook.sheetnames


    # Count the number of sheets
    num_sheets = len(sheet_names)


    return num_sheets

path  = "Bank_Personal_Loan_Modelling.xlsx"
n_sheets = get_number_of_sheets(path)

print("The file", path, "has:", n_sheets, "sheets")


The file Bank_Personal_Loan_Modelling.xlsx has: 2 sheets


In [2]:

import openpyxl

workbook = openpyxl.load_workbook("Bank_Personal_Loan_Modelling.xlsx")
sheet_names = workbook.sheetnames
for sheet_name in sheet_names:
    print(sheet_name)


Description
Data


In [3]:
# Get a list of numbers from 1 10 
numbers = list(range(1, 11))

# Print the numbers
print(numbers)

# Loop over the list of numbers and print each number
for number in numbers:
    print(number)
    

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
1
2
3
4
5
6
7
8
9
10
