# **Assignment 7**


# Introduction

In this section, please describe the dataset you are using.  Include a link to the source of this data.  You should also provide some explanation on why you choose this dataset.

### Describing Dataset

### Data Source Link

### Justification for Dataset Selection


# Data Exploration (EDA)
Import your dataset into your .ipynb, create dataframes, and explore your data.  

Include: 

* Summary statistics means, medians, quartiles, 
* Missing value information
* Any other relevant information about the dataset.  



### Importing Libraries

#### The first step is to import Pandas and NumPy libraries for data exploration, cleaning and manipulation.

In [None]:
import numpy as np
import pandas as pd
import datetime
#import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import warnings
warnings.filterwarnings('ignore')

In [None]:
csv_path = 'resources/austin_tx_airbnb.csv'
airbnb_df = pd.read_csv(csv_path, encoding="utf-8")
airbnb_df.head(5)

In [None]:
airbnb_df.tail(5)

In [None]:
airbnb_df.info()

In [None]:
airbnb_df.shape

In [None]:
airbnb_df.dtypes

In [None]:
for column in airbnb_df.columns:
    print(f"Column {column} has {airbnb_df[column].isnull().sum()} null values")

In [None]:
# Function checking for missing values
def missing_values_table(df):
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0 : 'Missing Values', 1 : '% of Total Values'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
    '% of Total Values', ascending=False).round(1)
    print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
        "There are " + str(mis_val_table_ren_columns.shape[0]) +
            " columns that have missing values.")
    return mis_val_table_ren_columns

In [None]:
missing_values_table(airbnb_df)

In [None]:
airbnb_df['last_review'].value_counts()

In [None]:
airbnb_df['last_review'].values.tolist()

In [None]:
airbnb_df['reviews_per_month'].value_counts()

In [None]:
airbnb_df['host_name'].value_counts()

In [None]:
for c in airbnb_df.columns:
    print("---- %s ---" % c)
    print(airbnb_df[c].value_counts())

# Data Wrangling
Create a subset of your original data and perform the following.  

1. Modify multiple column names.

2. Look at the structure of your data – are any variables improperly coded? Such as strings or characters? Convert to correct structure if needed.

3. Fix missing and invalid values in data.

4. Create new columns based on existing columns or calculations.

5. Drop column(s) from your dataset.

6. Drop a row(s) from your dataset.

7. Sort your data based on multiple variables. 

8. Filter your data based on some condition. 

9. Convert all the string values to upper or lower cases in one column.

10. Check whether numeric values are present in a given column of your dataframe.

11. Group your dataset by one column, and get the mean, min, and max values by group. 
  * Groupby()
  * agg() or .apply()

12. Group your dataset by two columns and then sort the aggregated results within the groups. 

**You are free (and should) to add on to these questions.  Please clearly indicate in your assignment your answers to these questions.**

### Create a subset of your original data and perform the following:

In [None]:
new_df = airbnb_df.iloc[0:18330]

In [None]:
new_df.shape

### 1- Modify multiple column names.

In [None]:
new_df.columns

In [None]:
new_df.rename(columns = {'name':'specifications', 'calculated_host_listings_count':'listings_count', 'availability_365': 'availability_days'}, inplace = True)

In [None]:
new_df.columns

### 2- Look at the structure of your data – are any variables improperly coded? Such as strings or characters? Convert to correct structure if needed.

In [None]:
new_df['last_review'].values.tolist()

In [None]:
new_df["last_review"] = pd.to_datetime(new_df["last_review"])

In [None]:
new_df.head()

In [None]:
new_df.info()

### 3- Fix missing and invalid values in data.

In [None]:
new_df['last_review'] = pd.to_datetime(new_df['last_review']).dt.date

In [None]:
new_df[['last_review']] = new_df[['last_review']].fillna('0000-00-00')

In [None]:
new_df[['host_name']] = new_df[['host_name']].fillna('unknown')


In [None]:
new_df.info()

In [None]:
new_df['reviews_per_month'] = new_df['reviews_per_month'].fillna(0)

In [None]:
new_df['last_review'].isnull().sum()

In [None]:
new_df.info()

### 4- Create new columns based on existing columns or calculations.

In [None]:
new_df['monthly_reviews'] = new_df['reviews_per_month'].apply(np.ceil)


In [None]:
new_df['monthly_reviews'] = new_df['monthly_reviews'].round(0).astype('int64')

In [None]:
new_df['availibility_status'] = ["not_available" if x == 0 else "available" for x in new_df['availability_days']]

In [None]:
new_df.info()

### 5- Drop column(s) from your dataset.

In [None]:
new_df = new_df.drop(["neighbourhood_group", "license"], axis=1)

In [None]:
new_df.columns

### 6- Drop a row(s) from your dataset.

In [None]:
new_df.drop([0, 1])

In [None]:
new_df.shape

### 7- Sort your data based on multiple variables.

In [None]:
new_df = new_df.sort_values(['id', 'last_review'],
              ascending = [True, True]) 

In [None]:
new_df.head()

### 8- Filter your data based on some condition.

In [None]:
filter_df = new_df[new_df["room_type"].str.contains("Entire home/apt")]
filter_df.head()

### 9- Convert all the string values to upper or lower cases in one column.

In [None]:
filter_df['room_type'] = filter_df['room_type'].str.upper()
filter_df.head()

### 10- Check whether numeric values are present in a given column of your dataframe.

In [None]:
# creating bool series with new column
filter_df['specifications_numeric'] = list(map(lambda x: x.isdigit(), filter_df['specifications']))
filter_df.head()

### 11- Group your dataset by one column, and get the mean, min, and max values by group.

In [None]:
result_df = new_df.groupby('room_type').agg({'price': ['mean', 'min', 'max']})
  
print("Mean, min, and max values of room price are:")
print(result_df)

### 12- Group your dataset by two columns and then sort the aggregated results within the groups.

In [None]:
df = new_df.groupby(['room_type'])['price'].mean().sort_values(ascending=False).head(2)
df

In [None]:
# Function checking for missing values
def missing_values_table(df):
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0 : 'Missing Values', 1 : '% of Total Values'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
    '% of Total Values', ascending=False).round(1)
    print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
        "There are " + str(mis_val_table_ren_columns.shape[0]) +
            " columns that have missing values.")
    return mis_val_table_ren_columns

In [None]:
missing_values_table(new_df)

### Summary statistics means, medians, quartiles

In [None]:
new_df.describe()

In [None]:
new_df.isnull().sum()
new_df.dropna(how='any',inplace=True)
new_df.info()

In [None]:
new_df.duplicated().sum()
new_df.drop_duplicates(inplace=True)
new_df

In [None]:
help()

In [None]:
# Generate a scatter plot 
room_type = new_df.iloc[:,7]
price = new_df.iloc[:,8]
plt.scatter(room_type,price)
plt.xticks(room_type)
plt.xlabel('room_type')
plt.ylabel('Price')
plt.show()

In [None]:
# plotting a bar graph
new_df.plot(x="room_type", y="price", kind="bar")


In [None]:
# plotting a pie chart
plt.pie(df["price"], labels=df["room_type"])
plt.show()

# Conclusions  

After exploring your dataset, provide a short summary of what you noticed from this dataset.  What would you explore further with more time?