# 1. Getting Started: Airbnb Copenhagen

This assignment deals with the most recent Airbnb listings in Copenhagen. The data is collected from [Inside Airbnb](http://insideairbnb.com/copenhagen). Feel free to explore the website further in order to better understand the data. The data (*listings.csv*) has been collected as raw data and needs to be preprocessed.

**Hand-in:** Hand in as a group in Itslearning in a **single**, well-organized and easy-to-read Jupyter Notebook. Please just use this notebook to complete the assignment.

If your group consists of students from different classes, upload in **both** classes.

The first cell does some preprocessing. Please just run these cells and do not change anything. The assignment starts below. Make sure that listings.csv' is in the same folder as this notebook




In [None]:
# pip install pandas
# pip install scikit-learn
import pandas as pd
import sklearn as sk

# load the data
data = pd.read_csv('listings.csv')

# filter relevant columns
data_limited = data[["id",
    "name",
    "host_id"  ,
    "host_name" , 
    "neighbourhood_cleansed"  ,
    "latitude"  ,
    "longitude"  ,
    "room_type"  ,
    "price"  ,
    "minimum_nights"  ,
    "number_of_reviews",  
    "last_review"  ,
    "review_scores_rating"  ,
    "review_scores_accuracy" , 
    "review_scores_cleanliness"  ,
    "review_scores_checkin"  ,
    "review_scores_communication"  ,
    "review_scores_location"  ,
    "review_scores_value"  ,
    "reviews_per_month"  ,
    "calculated_host_listings_count"  ,
    "availability_365",]]

# removing rows with no reviews

data_filtered = data_limited.loc[data_limited['number_of_reviews'] != 0]

# remove nan

data_filtered = data_filtered.dropna()
data_filtered.head()

# get a list of distinct values from neighbourhood_cleansed columns in data_filtered

neighbourhoods = data_filtered["neighbourhood_cleansed"].unique()

# replace e.g. Nrrbro with Nørrebro in neighbourhood_cleansed column

data_filtered["neighbourhood_cleansed"] = data_filtered["neighbourhood_cleansed"].replace("Nrrebro", "Nørrebro")
data_filtered["neighbourhood_cleansed"] = data_filtered["neighbourhood_cleansed"].replace("sterbro", "Østerbro")
data_filtered["neighbourhood_cleansed"] = data_filtered["neighbourhood_cleansed"].replace("Vanlse", "Vanløse")
data_filtered["neighbourhood_cleansed"] = data_filtered["neighbourhood_cleansed"].replace("Brnshj-Husum", "Brønshøj-Husum")
neighbourhoods = data_filtered["neighbourhood_cleansed"].unique()

# Remove dollar signs and commas and convert to float - note the prices are actually in DKK
data_filtered['price'] = data_filtered['price'].replace('[\$,]', '', regex=True).astype(float)

# Calculate the median price
median_price = data_filtered['price'].median()

# Create a new column 'price_category' with 0 for 'affordable' and 1 for 'expensive'
data_filtered['price_category'] = (data_filtered['price'] > median_price).astype(int)

display(data_filtered.head())

# Describe the apartments using a wordcloud
# Remember to install packages
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Combine the two lists of stop words
stop_words = ['Østerbro', 'Copenhagen', 'København', 'in', 'bedroom', 'bedrooms', 'bed', 'beds', 'bath', 'baths', 'Frederiksberg', 'V', 'Ø', 'SV', 'S', 'N', 'K', 'C', 'W', 'kbh', 'Ballerup', 'Hellerup', 'Valby', 'Vanløse', 'Brønhøj', 'Nørrebro', 'Vesterbro', "CPH", "with", "to", "of", "a", "the", "på", "i", "med", "af", "at", "city", "by", "apartment", "appartment", "lejlighed", "flat", "m2", "apt"]

# Convert the 'name' column to a single string
text = ' '.join(data_filtered['name'].astype(str))

# Create and generate a word cloud image
wordcloud = WordCloud(stopwords=stop_words, background_color="white", width=800, height=400).generate(text)

# Display the generated word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

## Your tasks start here


### 1. Since data science is so much fun, provide a word cloud of the names of the hosts, removing any names of non-persons. Does this more or less correspond with the distribution of names according to [Danmarks Statistik](https://www.dst.dk/da/Statistik/emner/borgere/navne/navne-i-hele-befolkningen)?

In [None]:
# State your solution here. Add more cells if needed.

host_names = data_filtered['host_name'].dropna().astype(str)
business_words = ['airbnb', 'rental', 'apartment', 'home', 'house', 'property', 'stay', 'hosting', 
                  'management', 'company', 'group', 'team', 'service', 'copenhagen', 'cph']

def is_person_name(name):
    name_lower = name.lower()
    # Remove if contains numbers
    if any(char.isdigit() for char in name):
        return False
    # Remove if contains business-related words
    if any(word in name_lower for word in business_words):
        return False
    # Remove if too long (likely business names)
    if len(name) > 30:
        return False
    return True

# Filter for person names only
person_names = [name for name in host_names if is_person_name(name)]

text = ' '.join(person_names)

name_stopwords = ['and', 'og', '&', 'the', 'de', 'van', 'von', 'el', 'la']

namesworldcloud = WordCloud(stopwords=name_stopwords, background_color="white", 
                           width=800, height=400, max_words=100).generate(text)

# Display the generated word cloud
plt.figure(figsize=(10, 5))
plt.imshow(namesworldcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

### 2. Using non-scaled versions of latitude and longitude, plot the listings data on a map.

In [None]:
# State your solution here. Add more cells if needed.
import contextily as ctx
price_cap = data_filtered['price'].quantile(0.95)
data_plot = data_filtered[data_filtered['price'] < price_cap]

fig, ax = plt.subplots(figsize=(15, 12))
scatter = ax.scatter(data_plot['longitude'], data_plot['latitude'], 
            c=data_plot['price'], cmap='coolwarm', alpha=0.6, s=10, )

ax.set_xlim(data_plot['longitude'].min() - 0.01, data_plot['longitude'].max() + 0.01)
ax.set_ylim(data_plot['latitude'].min() - 0.01, data_plot['latitude'].max() + 0.01)


ctx.add_basemap(ax, crs='EPSG:4326', source=ctx.providers.CartoDB.Positron)

cbar = plt.colorbar(scatter, ax=ax, shrink=0.6)
cbar.set_label('Price (DKK)', rotation=270, labelpad=15)

ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Copenhagen Airbnb Listings Colored by Price')
plt.show()

### 3. Create boxplots where you have the neighbourhood on the x-axis and price on the y-axis. What does this tell you about the listings in Copenhagen? Keep the x-axis as is and move different variables into the y-axis to see how things are distributed between the neighborhoods to create different plots (your choice).

In [None]:
# Create a list of distinct neighbourhoods
neighbourhoods = data_filtered['neighbourhood_cleansed'].unique()
neighbourhoods_list = [x for x in neighbourhoods]

# Create a function to make lists of values for each neighbourhood
# The lists will be named after the neighbourhoods and contain just numeric values
def create_neighbourhood_lists(column):
    list = []
    for neighborhood in neighbourhoods:
        data_points = data_filtered.loc[data_filtered['neighbourhood_cleansed'] == neighborhood, column].tolist()
        var_name = neighborhood.replace(" ", "_").replace("-", "_").replace("ø", "oe").replace("Ø", "Oe").replace("æ", "ae").replace("å", "aa").replace("Å", "Aa")
        globals()[var_name] = data_points
        list.append(data_points)
    return list

In [None]:
# Using the function, create box plot for prices
neighbourhoods_prices = create_neighbourhood_lists('price')
plt.boxplot(neighbourhoods_prices, showfliers=False)
plt.title('Price Distribution by Neighbourhood')
plt.xlabel('Neighbourhood')
plt.xticks(ticks=range(1, 12), labels=neighbourhoods_list, rotation=45, ha='right')
plt.ylabel('Price')
plt.show()

In [None]:
neighbourhoods_min_nights = create_neighbourhood_lists('minimum_nights')
plt.boxplot(neighbourhoods_min_nights, showfliers=False)
plt.title('Minimum Nights Distribution by Neighbourhood')
plt.xlabel('Neighbourhood')
plt.xticks(ticks=range(1, 12), labels=neighbourhoods_list, rotation=45, ha='right')
plt.ylabel('Minimum Nights')
plt.show()

In [None]:
# Create an overall review column, with a value of mean reviews

review_columns = [
    "review_scores_rating",
    "review_scores_accuracy",
    "review_scores_cleanliness",
    "review_scores_checkin",
    "review_scores_communication",
    "review_scores_location",
    "review_scores_value"
]

data_filtered["review_overall"] = data_filtered[review_columns].mean(axis=1)

In [None]:
neighbourhoods_reviews = create_neighbourhood_lists('review_overall')

plt.boxplot(neighbourhoods_reviews, showfliers=False)
plt.title('Reviews Scores Distribution by Neighbourhood')
plt.xlabel('Neighbourhood')
plt.xticks(ticks=range(1, 12), labels=neighbourhoods_list, rotation=45, ha='right')
plt.ylabel('Reviews')
plt.show()

### 4. Do a descriptive analysis of the neighborhoods. Include information about room type in the analysis as well as one other self-chosen feature. The descriptive analysis should contain mean/average, mode, median, standard deviation/variance, minimum, maximum and quartiles.

In [None]:
print("\n--- Counts of Room Types by Neighbourhood ---")
room_type_counts = pd.crosstab(data_filtered['neighbourhood_cleansed'], data_filtered['room_type'])
display(room_type_counts)

In [None]:
review_score_stats = data_filtered.groupby('neighbourhood_cleansed')['review_overall'].describe()
display(review_score_stats)


In [None]:
print("\n--- Custom Descriptive Statistics for Price by Neighbourhood ---")
custom_price_stats = data_filtered.groupby('neighbourhood_cleansed')['price'].agg(
    min_price='min',
    mean_price='mean',
    median_price='median',
    max_price='max',
    std_deviation='std',
    variance='var',
    count='count'
)

display(custom_price_stats.round(2))

### 5. Based on self-chosen features, and with "price_category" as your target, develop a k-Nearest Neighbor model to determine whether a rental property should be classified as 0 or 1. Remember to divide your data into training data and test data. Comment on your findings.

In [None]:
ml_data = data_filtered[[
    "latitude"  ,
    "longitude"  ,
    "minimum_nights"  ,
    "review_overall",
    "calculated_host_listings_count"  ,
    "availability_365",
    "price_category"]]

one_hot_neighbourhood = pd.get_dummies(data_filtered['neighbourhood_cleansed'], dtype=float)
one_hot_room_type = pd.get_dummies(data_filtered['room_type'], dtype=float)

ml_data = pd.concat([ml_data, one_hot_room_type, one_hot_neighbourhood], axis=1)

ml_data.head()

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

scaler = StandardScaler()

y = ml_data['price_category']
X = ml_data.drop(columns=['price_category'])

X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [None]:
model = KNeighborsClassifier(n_neighbors=7)
model.fit(X_train, y_train)

print("Score on training data:", model.score(X_train, y_train))
print("Score on test data:", model.score(X_test, y_test))