# **EXPLORATORY DATA ANALYSIS**


FEATURES OF DATASET
1.	patient_id : Assigns an identification number to every patient who was tested COVID19 positive
2.	sex : Gives the information about the patient’s gender 
3.	Age : Range of age of the person. (Eg: if a person’s age is 25, he would lie in 20s)
4.	Country : Country to which the patient belongs
5.	Province : Province to which the patient belongs
6.	City : City to which the patient belongs 
7.	infection_case : Cause of the infection
8.	infected_by: the id of the patient that affected this person
9.	contact_no : the no. of people this patient came in contact with
10.	symptom_onset_date : the date showing the onset of symptoms
11.	confirmed_date : the date on which the person tested positive
12.	released_date : the date on which the patient was released from the isolation center
13.	deceased_date : the date on which the patient died
14.	state : the current state of the person (isolated,released,deceased)


In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns


In [None]:
data = pd.read_csv('../input/coronavirusdataset/PatientInfo.csv')
# To display the top 5 rows
data.head(5)

In [None]:
data.shape

The data consists of 5165 rows and 14 columns.

In [None]:
data.info

The detailed info of data.

In [None]:
data.dtypes

The data types of various columns.

## ***Cleaning data*** 

In [None]:
data.isnull().sum()

The sum of null values are depicted above.

Extracting the numeric value from range -

In [None]:
data['age_range_numeric'] = data.age.str[:2]
data.head()

Converting the numerical value to float -

In [None]:
def stringremover(doc):
    l = doc.split()
    for i in l:
        try:
            num = float(i)
            return num
        except:
            continue

In [None]:
contactn=[]
data['contact_number'] = data['contact_number'].fillna("0.0")
for i in range(len(data['contact_number'])):
    doc = data['contact_number'][i]
    res = stringremover(doc)
    contactn.append(res)
data['contact_number']=contactn

In [None]:
age = []
data['age_range_numeric'] = data['age_range_numeric'].fillna("0.0")
for i in range(len(data['age_range_numeric'])):
    doc = data['age_range_numeric'][i]
    res = stringremover(doc)
    age.append(res)
data['age_range_numeric'] = age


In [None]:
data.drop(data[data['age'] == "0s"].index, inplace = True)
data.drop(data[data['age_range_numeric'] == 0.0].index, inplace = True)
data.drop(data[data['contact_number'] == 0.0].index, inplace = True)

## ***UNIVARIATE ANALYSIS***

ANALYSIS ON INFECTION CASE

In [None]:
data['infection_case'].unique()

This is the list of factors of infection.

In [None]:
data.infection_case.value_counts().plot.bar()
plt.title("INFECTION CASE")
plt.xlabel("infection case")
plt.ylabel("count")

The graph shows the various infection cases from which people got infected and their respective frequency.
On analyzing the infection case, it can be observed that more number of patients are affected by coming in contact with the other patients followed by overseas inflow

ANALYSIS ON GENDER

In [None]:
data.sex.value_counts().plot.pie(autopct="%0.2f%%")
plt.title("Gender ratio")

The plot gives us an understanding of sex ratio of patients. It can be noticed that nearly equal number of men and women are becoming victim of this pandemic.

ANALYSIS ON CITY

In [None]:
data['city'].unique()

In [None]:
data['city'].nunique()

There are total 65 cities from which people are affected.

In [None]:
plt.figure(figsize=(25,50))
data.city.value_counts().plot(kind='barh')
plt.title("City count")
plt.xlabel("count")
plt.ylabel("city")

The above bar graph represents the number of people affected city wise.
It can be observed that Songpa-gu is the least affected and Cheonan-si  is the most affected(approx 70 people).

ANALYSIS ON COUNTRY

In [None]:
data.country.value_counts().plot.bar()
plt.title("Most affected Country")
plt.xlabel("country")
plt.ylabel("count")

The graph shows - Country wise no. of patients. Korea is the most affected country which can be observed on the Y-axis. The number of people from other countries are almost negligible probably because the survey was conducted in Korea.

ANALYSIS ON STATE

In [None]:
plt.title("State ratio")
data.state.value_counts().plot.pie()
centre_circle = plt.Circle((0,0),0.5,color='black', fc='white',linewidth=1.25)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

The donut plot gives information about the number of patients released, deceased and isolated. It can be seen that more number of patients are released from the isolation center, then comes the number of patients being isolated and very less number of patients died.

ANALYSIS ON AGE

In [None]:
data.age_range_numeric.plot.hist()
plt.title("Age Histogram")
plt.xlabel("Age range")

It can be observed that people who belong to the age in the range of 20’s are most affected.There is nobody affected in the range of 40 to 50.

In [None]:
sns.distplot(data['age_range_numeric'])
plt.title("Age distribution plot")

The graph shows the propbability of the ages and the line in the graph the maximum concentration in that particular range.

ANALYSIS ON PROVINCE

In [None]:
plt.figure(figsize=(20,5))
data.province.value_counts().plot.bar()
plt.xlabel("Province")
plt.ylabel("count")
plt.title("Province with the respective patient count")

Gyeonggi-do is the most affected province while Jeju-do is least affected. The graph is clearly decreasing in nature.

ANALYSIS ON CONTACT NUMBER

In [None]:
plt.figure(figsize=(25,10))
data.contact_number.value_counts().plot.bar()
plt.title("The count of Number of people patients made contact with")

This shows that most of the patients made contact with at least 2 people.

CONVERTING CATEGORICAL TO NUMERICAL VARIABLES

In [None]:
data["infection_case"] = data["infection_case"].astype('category')
data["infection_case"] = data["infection_case"].cat.codes
data.head()

In [None]:
data.describe()

## ***BIVARIATE ANALYSIS***

In [None]:
data.groupby('sex')['contact_number'].mean().plot.bar()
plt.ylabel("Contact numbers mean")
plt.title("Male vs Female wrt Contact number")

The graph gives us an idea of the gender wise mean of the number of people who came in contact with patients. The males have greater contact numbers that means the number of people who came in contact with males is higher than in female.

In [None]:
plt.figure(figsize=(20,10))
plt.title('age wise state frequency',fontsize = 15)
sns.countplot(y='age', hue='state',data=data)

This graph depicts that most of the people in their 20's are released or are isolated while the maximum number of deceased people were in their 70s and 80s  

In [None]:
sns.countplot(data=data, x='age', hue='sex')
plt.title("age wise sex ratio")

This shows that how many male and female are there in a particular age range. The maximum number of males and females who were tested COVID19 are in their 20s. Also, in almost every age group there are more females affected than males.

In [None]:
data['released_date'] = pd.to_datetime(data['released_date'], format='%Y-%m-%d')

In [None]:
data['confirmed_date'] = pd.to_datetime(data['confirmed_date'], format='%Y-%m-%d')

In [None]:
data['deceased_date'] = pd.to_datetime(data['deceased_date'], format='%Y-%m-%d')

In [None]:
plt.figure(figsize=(15,5))
plt.scatter(data['confirmed_date'],data['released_date'],marker='x')
plt.title("Scatter plot of confirmed and released date of patient")

There were a lot of people released between 15 Feb and 15 March.

In [None]:
data.corr()

The above matrix shows all the possible correlations between numerical variables.

Negative correlation is a relationship between 2 variables in which as one variable increases, the other decreases and vice versa.
A perfect negative correlation is -1, a 0 indicates no correlation, and a +1 indicates a perfect positive correlation.

### Heatmap to show the relationships of the variables using Correlation Matrix 

In [None]:
plt.figure(figsize=(10,8))
plt.title('Heat Map', fontsize=20)
sns.set(font_scale=1.0)
sns.heatmap(data.corr(), cbar=True, annot =True, square=True, fmt='.2f',annot_kws={'size':15}, linewidth=3)