## Real World Dataset Selection


# World University Ranking

This project is about "World University Rankings" Data Analysis with Python. Collected this Dataset from "Kaggle" which is the world's largest data science community with powerful tools and resources.

This dataset contains 2200 rows and 14 columns which is really informaive to analysis.In this project,an attempt has been made to analyze various information of world university such as quality_of_education, quality_of_faculty, publications, citations, university score and many more.

Library Used:
* pandas
* matplotlib
* seaborn


> - Dataset link: https://www.kaggle.com/mylesoneill/world-university-rankings
> - The data is in CSV format, which contain 2200 rows and 14 columns
> - Downloaded the dataset using the `opendatasets` library

#### Importing Necessary Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

#### Download Data from Link

In [None]:

data_dir = '../input/world-university-rankings/cwurData.csv'

## Data Preparation and Cleaning


> - Load the dataset into a data frame using Pandas
> - Explore the number of rows & columns, ranges of values etc.
> - Handle missing, incorrect and invalid data
> - Perform any additional steps (parsing dates, creating additional columns, merging multiple dataset etc.)

### Data Preparation

In [None]:
university_raw_df=pd.read_csv(data_dir)

In [None]:
university_raw_df # Here is Dataset or Data frame

In [None]:
university_raw_df.head(10) # Display Top 10 rows of Dataframe

In [None]:
university_raw_df.info() # Display Dataframe Information

In [None]:
university_raw_df.describe() # Describing Dataframe

In [None]:
university_raw_df.shape # Dataframe Rows and Columns showed

In [None]:
university_raw_df.index # only Dataframe rows showed

In [None]:
university_raw_df.columns # Dataframe columns name showed

In [None]:
university_raw_df.isnull().sum() #used to check and manage Total NULL values in a data frame


In [None]:
university_raw_df.isnull().sum(axis = 0) #NaN values in every column

In [None]:
university_raw_df.isnull().sum(axis = 1) #NaN values in every row

In [None]:
missing_values_count = university_raw_df.isnull().sum()
missing_values_count

In [None]:
# how many total missing values does it have?
total_cells = np.product(university_raw_df.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

In [None]:
import missingno as msno
msno.bar(university_raw_df)

### Data Cleaning
Here, broad_impact column has 200 NaN values so we have to fill up NaN values with 0. 

In [None]:
university_raw_df['broad_impact'].fillna(0)

In [None]:
university_raw_df['broad_impact'] = university_raw_df['broad_impact'].fillna(0)

In [None]:
university_raw_df

In [None]:
university_raw_df.isnull().sum() #again check is there any NaN or not

In [None]:
university_raw_df.isnull().values.any() #object to indicate if any value is missing

In [None]:
university_raw_df['broad_impact'].isnull().sum() #Count the NaN under a single DataFrame column

In [None]:
university_raw_df.isnull().sum().sum() #Count the NaN under an entire DataFrame

In [None]:
university_raw_df.isnull() #used to check and manage NULL values in a data frame

In [None]:
count_nan = len(university_raw_df)

In [None]:
count_nan # count Nan length

In [None]:
university_raw_df[university_raw_df.duplicated()]

There is no duplicate values

In [None]:
university_raw_df.dtypes

## Exploratory Analysis and Visualization



> - Compute the min, max, median, std and other interesting statistics for numeric columns
> - Make a note of interesting insights from the exploratory analysis
> - Correlation Graph using using Seaborn and Matplotlib
> - Explored relationship between columns using Bar Charts.
> - Explored relationship between columns using Line Markers with Matplotlib.
> - Displayed Pie Chart using Matplotlib.

### Exploratory Analysis 

In [None]:
university_raw_df.min() # Returns the lowest value in each column

In [None]:
university_raw_df.max() # Returns the highest value in each column

In [None]:
university_raw_df.alumni_employment.max()

In [None]:
university_raw_df.median() # Returns the median of each column

In [None]:
university_raw_df.std() # Returns the standard deviation of each column

In [None]:
 university_raw_df.loc[1]

In [None]:
university_raw_df.rank() # to display rank

In [None]:
university_raw_df.sort_values(by='country' ) # sorting countrywise

In [None]:
university_raw_df.sort_index() # sorting indexwise

In [None]:
university_raw_df.loc[4,'country'] # # Display index wise country name

In [None]:
 university_raw_df.loc[1,'institution'] # Display index wise institution name

In [None]:
university_raw_df['country'].value_counts() # returns the count of unique entries in that column


In [None]:
university_raw_df['institution'].nunique()

In [None]:
university_raw_df.count # find the number of non-NA/null value across the row axis 

In [None]:
university_raw_df.iloc[7,3]

In [None]:
university_raw_df['country'].unique()

In [None]:
university_raw_df.corr() ## Returns the correlation between columns in a dataframe

In [None]:
university_raw_df.cov() # compute pairwise covariance of columns, excluding NA/null values

In [None]:
university_raw_df.iloc[0:2, :] #to select the first row of a dataframe and all of the columns

In [None]:
university_raw_df.publications.count()

In [None]:
university_raw_df.groupby("year").mean() # Grouping by year

In [None]:
university_raw_df.groupby("publications").aggregate(['min', max])

#### Q1: Display university information which has got 100 score.

In [None]:
university_raw_df[university_raw_df["score"] == 100]

### Data Visualization

Let's begin by importing `matplotlib.pyplot` and `seaborn`.

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import plotly.express as px

%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(university_raw_df.corr(), annot=True)
plt.title("Correlation Graph", fontweight="bold")


In [None]:
fig = px.scatter_3d(university_raw_df, x='publications', y='influence', z='citations')
fig.update_traces(marker_size=3, marker_opacity=0.5)
fig.show()

In [None]:
!pip install AutoViz
!pip3 install xlrd

import plotly
import plotly.express  as px
from autoviz.AutoViz_Class import AutoViz_Class

AV= AutoViz_Class()
df_autoviz= AV.AutoViz('../input/world-university-rankings/cwurData.csv')

In [None]:
px.scatter(data_frame=university_raw_df,
                x= 'world_rank',
                y='publications',
                size='year',
                color='publications',
                title= 'Day Expand and World Rank',
                labels= {'world_rank': 'world_rank',
                         'year': 'Year Expand'},
                log_x= True,
                range_y= [0,400],
                hover_name= 'institution',
                animation_frame='world_rank',
                height= 400,
                size_max=40)

#### Q2: How do you show University of Oxford progression between 2012-2015?

### Line Chart
Line Chart can show the progression of University of Oxford betweeen 2012-2015. For that, oxford university score can help us to display the chart

In [None]:
years = [2012, 2013, 2014, 2015]

In [None]:
score = [82.34, 92.54, 97.51,96.46]

In [None]:
plt.plot(years, score)
plt.xlabel('Year')
plt.ylabel('Progress (Score)')
plt.title("University of Oxford");

#### Q3: How do you show World Top 10 Univesity with their education quality?

### Bar Chart
This Bar Chart shows relation between education quality and world rank according to world top 10 university ranking. 

In [None]:
sns.barplot(x=university_raw_df['world_rank'].head(10), y=university_raw_df['quality_of_education']) 

#### Q4: How do you show the difference between publications and influence column?

### Line Markers 
Line Markers for the data points on each line using the `marker` argument of `plt.plot`.Here, Line Markers point university 10 publications and 10 influence according to top 10 world rank. We can see from the plot where publications are more ahead than influence. Yellow color shows publications and Blue color shows influence.

In [None]:
world_rank = range(1,11)
publications = [1,12,4,16,37,53,15,14,13,6]
influence = [1,4,2,16,22,33,13,6,12,5]

In [None]:
plt.plot(world_rank, influence, marker='o')
plt.plot(world_rank, publications, marker='x')

plt.xlabel('world_rank')
plt.ylabel('influence')

plt.title("world_rank")
plt.legend(['influence', 'publications']);

#### Q5: What are the World Top 5 University in 2015?

### Pie Chart

Pie Chart shows top 5 world university ranking in 2015 according to their scores.


In [None]:
import matplotlib.pyplot as plt

# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = 'Harvard University','Stanford University','Massachusetts Institute of Technology','University of Cambridge','University of Oxford'
sizes = [100,98.66,97.54,96.81,96.46] # university score
explode = (0.1, 0, 0, 0,0)  

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Top 5 World University in 2015')
plt.show()

## <center>Thanks for checking it out! Don't Forget to Upvote !!