![](https://images.unsplash.com/photo-1533854775446-95c4609da544?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=1050&q=80)

In the first part of this notebook we'll read a **JSON** file and extract statistics on **US universities**. In the second part we'll **merge** our 'university-statistics' table with a bigger table containing **latitude and longitude data** of all American cities which I uploaded to Kaggle from https://simplemaps.com/data/us-cities before. Geospatial information will allow us to put all universities on the **map** in the third part of this work.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Part 1. Cleaning the Table and Extracting Stats

In [None]:
# read a JSON file
df = pd.read_json("../input/university-statistics/schoolInfo.json")

In [None]:
# show all columns
pd.options.display.max_columns = None

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# drop columns which have only NaN values
df.dropna(axis=1, thresh=1, inplace=True)

In [None]:
# drop columns with only one distinct value
for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)

In [None]:
df.head()

In [None]:
# drop non relevant columns or columns which semantically duplicate other columns in the DataFrame
df.drop(['primaryPhoto', 'primaryPhotoThumb', 'sortName', 'urlName', 'aliasNames', 'nonResponderText', 'nonResponder', 'rankingSortRank', 'overallRank', 'rankingRankStatus', 'xwalkId', 'primaryKey'], axis=1, inplace=True)

In [None]:
df.head()

In [None]:
# print all unique citites with universities
city = df['city'].unique()
print(sorted(city))

In [None]:
# number of unique cities
df['city'].nunique()

In [None]:
df.loc[df['city'] == 'St Louis']

In [None]:
df.loc[df['city'] == 'St. Louis']

In [None]:
# St. Louis is misspelled, a dot is missing, we replace the misspelled value with the correct one
df.replace('St Louis', 'St. Louis', inplace=True)

In [None]:
# we replace Ft. Lauderdale with Fort Lauderdale because
# in the United States Cities Database table it is Fort Lauderdale
# and we are going to merge two tables on city names
df.replace('Ft. Lauderdale', 'Fort Lauderdale', inplace=True)

In [None]:
# show which cities have the most universities
df['city'].value_counts().head(10)

**Chicago** and **New York** have the most universities.

In [None]:
# show which states have the most universities
df['state'].value_counts().head(10)

In [None]:
# create a new dataframe with the number of universities in each state and plot the graph
df_count = df['state'].value_counts().rename_axis('State').reset_index(name='Number of Universities')

df_count_to_plot = df_count

df_count_to_plot["State"] = df_count["State"]
df_count_to_plot["Number of Universities"] = df_count["Number of Universities"]

import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (18, 5)

df_count_to_plot.plot.bar(x='State', rot=45)

Unsurprisingly **California** and **Texas** as the biggest states have also the most universities.

In [None]:
# move the column with university names to the front so the table gets more readable
df = df[ ['displayName'] + [ col for col in df.columns if col != 'displayName' ] ]

In [None]:
# universities with the highest ranking
df.sort_values(by=['rankingDisplayScore'], ascending=False).head()

**Princeton University** has the highest ranking.

In [None]:
# universities with the highest enrollment
df.sort_values(by=['enrollment'], ascending=False).head()

**University of Central Florida** is the biggest one by enrollment.

In [None]:
# universities with the highest tuition
df.sort_values(by=['tuition'], ascending=False).head()

**Columbia University** is the most expensive one.

In [None]:
# universities with the highest percent of students receiving aid
df.sort_values(by=['percent-receiving-aid'], ascending=False).head()

Interestingly the top 3 universities by students receiving aid are located in the **New York State**.

# Part 2. Merging tables

In [None]:
df2 = pd.read_csv("../input/united-states-cities-database/uscities.csv")

In [None]:
# columns of our United States Cities Database DataFrame
df2.columns

In [None]:
df2

In [None]:
# we see that in the 'city_ascii' column cities have the more universal spelling
df2.loc[(df2['city'] != df2['city_ascii'])]

In [None]:
df2.drop(['city'], axis=1, inplace=True)

In [None]:
df2.rename(columns={"city_ascii": "city"}, inplace=True)

In [None]:
# merge two tables on city names, we want citites from the 'university-statistics' table
# to have longitude and latitude data which we extract from the US Cities Database table
df_merged = pd.merge(df, df2, on='city')

In [None]:
df_merged

In [None]:
# many cities from different states have the same name that's why we have 1604 rows in the merged table
# we need to keep only cities from both tables which not only share the name,
# but are also located in the same state
df_merged = df_merged.loc[(df_merged['state'] == df_merged['state_id'])]

In [None]:
df_merged.info()

We have 294 rows in the new table, it's 16 rows fewer than in the original university table. We lost about 5% of universities on discrepancies between the two merged tables. Still we have enough relevant information for our interactive map as **longitude and latitude data** was correctly attributed in absolutely most cases.

# Part 3. Displaying the Map

In [None]:
import geopandas as gpd
import math
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster

In [None]:
# Create a map
m_1 = folium.Map(location=[42.32,-81.0589], tiles='openstreetmap', zoom_start=3)

# Add points to the map
for idx, row in df_merged.iterrows():
    Marker([row['lat'], row['lng']]).add_to(m_1)

# Display the map
m_1

In [None]:
# a new DataFrame with the top univiersities by ranking
df_top = df_merged.loc[df_merged['rankingDisplayScore'] >90]

In [None]:
# show a map with the top universities by ranking
m_2 = folium.Map(location=[42.32,-81.0589], tiles='openstreetmap', zoom_start=3)

for idx, row in df_top.iterrows():
    Marker([row['lat'], row['lng']]).add_to(m_2)

m_2

We see that the most top universities are located on the **East Coast**, two are in **California** and one in **Chicago**.

If you like this notebook, **you may also like**

* [Stack Overflow: Who can help with SQL & Python?](https://www.kaggle.com/sergejnuss/stack-overflow-who-can-help-with-sql-python)
* [StackOverflow Hits: Typescript, Python, Javascript](https://www.kaggle.com/sergejnuss/stackoverflow-hits-typescript-python-javascript)