# Data Acquisition

### Imports

In [1]:
import pandas as pd
import pydataset
import acquire

### Exercises

1. Use a python module (pydata or seaborn datasets) containing datasets as a source from the iris data. Create a pandas dataframe, df_iris, from this data.

    - print the first 3 rows
    - print the number of rows and columns (shape)
    - print the column names
    - print the data type of each column
    - print the summary statistics for each of the numeric variables. Would you recommend rescaling the data based on these statistics?

In [2]:
df_iris = pydataset.data("iris")

In [3]:
df_iris.head(3)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa


In [4]:
df_iris.shape

(150, 5)

In [5]:
df_iris.columns

Index(['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width',
       'Species'],
      dtype='object')

In [6]:
df_iris.dtypes

Sepal.Length    float64
Sepal.Width     float64
Petal.Length    float64
Petal.Width     float64
Species          object
dtype: object

In [7]:
df_iris.describe()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


2. Read Table1_CustDetails the excel module dataset, Excel_Exercises.xlsx, into a dataframe, df_excel

    - assign the first 100 rows to a new dataframe, df_excel_sample
    - print the number of rows of your original dataframe
    - print the first 5 column names
    - print the column names that have a data type of object
    - compute the range for each of the numeric variables.

In [8]:
xls = pd.ExcelFile("Spreadsheets_Exercises.xlsx")
df_excel = pd.read_excel(xls, "Table1_CustDetails")

In [9]:
df_excel_sample  = df_excel.head(100)

In [10]:
df_excel_sample.shape

(100, 12)

In [11]:
df_excel.shape[0]

7049

In [12]:
df_excel.columns[:5]

Index(['customer_id', 'gender', 'is_senior_citizen', 'partner', 'dependents'], dtype='object')

In [13]:
df_excel.select_dtypes(include='object').columns

Index(['customer_id', 'gender', 'partner', 'dependents', 'payment_type',
       'churn'],
      dtype='object')

In [14]:
int_max = df_excel.select_dtypes(include='int64').max()
int_min = df_excel.select_dtypes(include='int64').min()
int_max - int_min

is_senior_citizen    1
phone_service        2
internet_service     2
contract_type        2
dtype: int64

In [15]:
float_max = df_excel.select_dtypes(include='float64').max()
float_min = df_excel.select_dtypes(include='float64').min()
float_max - float_min

monthly_charges     100.5
total_charges      8666.0
dtype: float64

3. Read the data from this google sheet into a dataframe, df_google

    - print the first 3 rows
    - print the number of rows and columns
    - print the column names
    - print the data type of each column
    - print the summary statistics for each of the numeric variables
    - print the unique values for each of your categorical variables

In [16]:
sheet_url = "https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357"
csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')
df_google = pd.read_csv(csv_export_url)

In [17]:
df_google.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [18]:
df_google.shape

(891, 12)

In [19]:
df_google.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [20]:
df_google.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [21]:
df_google.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [22]:
df_google.select_dtypes(include='object').columns

Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')

In [23]:
df_google.select_dtypes(include='object').nunique()

Name        891
Sex           2
Ticket      681
Cabin       147
Embarked      3
dtype: int64

In [24]:
#get_titanic_data 

In [25]:
sheet_url = "https://docs.google.com/spreadsheets/d/1PmmRUXgmQ6oO9fLORG4oeMLe_mzIWlFAkJKA2cwLOLg/edit#gid=935554057"

In [26]:
csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')

In [27]:
df_titanic = pd.read_csv(csv_export_url)

In [28]:
def get_titanic_data():
    sheet_url = "https://docs.google.com/spreadsheets/d/1PmmRUXgmQ6oO9fLORG4oeMLe_mzIWlFAkJKA2cwLOLg/edit#gid=935554057"
    csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')
    df_titanic = pd.read_csv(csv_export_url)
    return df_titanic