# Python for Data Analysis II

## Individual assignment

## Part 1 (regular expressions)

The goal is to extract dates of different formats from medical data.
We should correctly identify all of the different date variants encoded in this dataset and to properly standardize and sort the dates.

###### or statement for  month year only, 2 and 4 digits strict

### Data loading

In [None]:
import re
with open("./medical_dataset.txt") as f:
    lines = f.readlines()

### String vectorization

In [None]:
import pandas as pd
pd.options.display.max_rows = None
df = pd.DataFrame(lines, columns=["text"])
df_original=df

### Steps

Each line of the file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

1- Have a look to the lines and take note of the different date formats in the file

2- Design and check a regular expression for each of these formats. Use vectorized strings in order to avoid loops

3- Try to rewrite these expressions more compactly (for example, by merging two or three regular expressions in one)

4- Create a dataframe with four columns: the original text, the month, the day and the year. All three fields must be numeric and the year must be represented by 4 digits. All texts must have this data extracted.

5- Save the final DataFrame to an excel file with name "processed_dates.xlsx"


### Tips

* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* There could be potential typos as this is a raw, real-life derived dataset.

### part 1: detecting and extracting dates in mm/dd/yy,yyyy for both numeric and alphabetic month  format**

In [None]:
df_01 = df["text"].str.extract(r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2,4})').dropna(how = 'any')
i_exclude = df_01.index
df=df[~df.index.isin(i_exclude)]


df_02= df["text"].str.extract(r'(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-zA-Z.,-]*[|,|\s|-|.]*(?P<day>\d{1,2})[\s|-|.|,]*(?P<year>\d{4})').dropna(how = 'any')
i_exclude =df_02.index 
df=df[~df.index.isin(i_exclude)]

### part2: detecting and extracting dates in day/month/yy,yyyy for both numeric and alphabetic month format**

In [None]:
df_03= df["text"].str.extract(r'(?P<day>\d{1,2})[|\s|-|,|.|](?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-zA-Z.,-]*[\s|-|.|,]*(?P<year>\d{2,4})?').dropna(how = 'any')
i_exclude =df_03.index 
df=df[~df.index.isin(i_exclude)]


### part 3 extracting only month and year

In [None]:
df_04= df["text"].str.extract(r'(?P<month>1[0-2]|[1-9])[\s|-|/|,]*(?P<year>\d{4})').dropna(how = 'any')
df_04['day']=[1]*df_04.shape[0]
i_exclude =df_04.index 
df=df[~df.index.isin(i_exclude)]


df_05= df["text"].str.extract(r'(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-zA-Z.,-]*[|,|\s|-|.]*(?P<year>\d{4})').dropna(how = 'any')
df_05['day'] = [1]*df_05.shape[0]
i_exclude =df_05.index 
df=df[~df.index.isin(i_exclude)]


### Concat

In [None]:
result = pd.concat([df_01,df_02,df_03,df_04, df_05])

result.year = [int('19'+ str(i)) if len(i)==2 else int(i) for i in result.year ]
result.month = [str(i).strip() for i in result.month]

result

### Final Result

In [None]:

cols = ["day", "month", "year"]
df_original.loc[result.index, cols] = result[cols]
df_original



In [None]:
##month to numeric
month_n=[]
for i in df_original.month:
    if 'Jan' in i:
        month_n.append('1')
    elif 'Feb' in i:
        month_n.append('2')
    elif 'Mar' in i:
        month_n.append('3')
    elif 'Apr' in i:
        month_n.append('4')
    elif 'May' in i:
        month_n.append('5')
    elif 'Jun' in i:
        month_n.append('6')
    elif 'Jul' in i:
        month_n.append('7')
    elif 'Aug' in i:
        month_n.append('8')
    elif 'Sep' in i:
        month_n.append('9')
    elif 'Oct' in i:
        month_n.append('10')
    elif 'Nov' in i:
        month_n.append('11')
    elif 'Dec' in i:
        month_n.append('12')
        
    else:
        month_n.append(i)
       
        
df_original.month= month_n
df_original.month = [str(i) for i in df_original.month]
df_original.month = ['0'+ str(i) if len(i)==1 else str(i) for i in df_original.month]
df_original.day   = [str(i) for i in df_original.day]
df_original.day = ['0'+ str(i) if len(i)==1 else str(i) for i in df_original.day]


#### Part1: Final Result

In [None]:
df_original

## Part 2 (plotly)

In [None]:
import plotly.offline as py
import plotly.graph_objs as go
import plotly.express as px
py.init_notebook_mode(connected=True) # this allows to display plotly graphs in Jupyter

In [None]:
df_ = pd.read_csv("./dataset_housing (2).csv")
pd.options.display.max_columns=1000
df_.head()

Explore using plotly / plotly express the following questions:

* Is there any relation between neighborhood and price?

* Is there any relation between neighborhood and year built?

* How overall quality, lot area, year built and price interact with each other?

* How quality, lot area, year built and price interact with each other and evolve in time?

### Is there any relation between neighborhood and price?

In [None]:
import plotly.express as px
df_1=df_[['Neighborhood', 'SalePrice']]
fig = px.scatter(df_1, x="Neighborhood", y="SalePrice", title = 'Sale Price by Neighborhood')
fig.update_xaxes(categoryorder = 'median ascending')

py.iplot(fig)
print('Ordered by median, it is clear that neighborhoods located on the right side of the x-axis')
print('appear to have properties with higher SalePrice while properties located in the left side of the x-axis')
print('appear to have properties with lower Sale Price')

### Is there any relation between neighborhood and year built?


In [None]:
import plotly.express as px
df_2=df_[['Neighborhood', 'YearBuilt']]
fig = px.box(df_2, x="Neighborhood", y="YearBuilt", title = 'Year built by Neighborhood')
fig.update_xaxes(categoryorder = 'median ascending')
py.iplot(fig)
print(' ')
print('The neighborhood betwwen OldTown and BrDale contains houses for all years built but the median is around 1960')
print('The neighborhood betwwen BrDale and Sawyer have contains houses for 1960 and later but the median is around 1980')
print('the Neighborhoods between Sawyer and NridgeHt have a median of 2000s, having the most recent house in the dataset')

### How overall quality, lot area, year built and price interact with each other?

In [None]:
df_3=df_[['OverallQual', 'LotArea','YearBuilt', 'SalePrice']]
fig = px.scatter_matrix(df_3)
fig.show()
print(' ')
print('Overall Quality and SalePrice has a clear positive relationship, the higher the quality, the higher the price')
print('SalePrice and year Built have a clear trend, houses more recently built havee a higher Sale Price')
print('The neighborhood betwwen BrDale and Sawyer have contains houses for 1960 and later but the median is around 1980')
print('the Neighborhoods between Sawyer and NridgeHt have a median of 2000s, having the most recent house in the dataset')

### How quality, lot area, year built and price interact with each other and evolve in time?

In [None]:
fig = px.scatter(
    df_, x="SalePrice", y="OverallQual", animation_frame="YrSold",
    animation_group="LotArea", size="SalePrice", color="YearBuilt",
    hover_name="LotArea", log_x=False, size_max=55,
    range_x=[20000,1000000], range_y=[1, 15]
)

py.iplot(fig)

print(' ')
print('Over the years, houses Overall Quality and SalePrice have mantained a clear positive relationship, the higher the quality, the higher the price')
