# Landslides After Rainfall, 2007-2016
## Location and cause of landslide events around the world

### Descritpion:
#### Context

Landslides are one of the most pervasive hazards in the world, causing more than 11,500 fatalities in 70 countries since 2007. Saturating the soil on vulnerable slopes, intense and prolonged rainfall is the most frequent landslide trigger.

#### Content

The Global Landslide Catalog (GLC) was developed with the goal of identifying rainfall-triggered landslide events around the world, regardless of size, impacts, or location. The GLC considers all types of mass movements triggered by rainfall, which have been reported in the media, disaster databases, scientific reports, or other sources.

#### Acknowledgements

The GLC has been compiled since 2007 at NASA Goddard Space Flight Center.

Data downloded from: https://www.kaggle.com/nasa/landslide-events

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:#7ca4cd; border:0' role="tab" aria-controls="home"><center>Data loading and cleaning</center></h3>

## landslide data set

In [None]:
df = pd.read_csv('../input/landslide-events/catalog.csv')
print('Shape of the file')
print('-'*30)
print(df.shape)
print('')
print('Number of missing value per columns')
print('-'*30)
print(df.isnull().sum())

### Dropping Columns

In [None]:
df = df.drop(columns=['time','continent_code','location_description','storm_name','fatalities','injuries','source_name','source_link',])
print('dataframe shape')
print('-'*30)
print(df.shape)

### Dropping rows

In [None]:
df = df.dropna()
print('dataframe shape')
print('-'*30)
print(df.shape)

In [None]:
print('Variable types')
print('-'*30)
print(df.dtypes)

### data types

In [None]:
print('Variable types')
print('-'*30)
print(df.dtypes)

In [None]:
# Converting date format
df['date'] = pd.to_datetime(df['date'],errors='coerce')

###  Finding inconsistent categories

In [None]:
# Print unique values for categorical variables:
print('hazard_type: ', df['hazard_type'].unique(), "\n")
print('landslide_type: ', df['landslide_type'].unique(), "\n")
print('landslide_size: ', df['landslide_size'].unique(),"\n")
print('trigger: ', df['trigger'].unique(),"\n")

In [None]:
# drop columns with only one category
df = df.drop(columns=['hazard_type'])

In [None]:
# Lowercase category
df['landslide_type'] = df['landslide_type'].str.lower()
df['landslide_size'] = df['landslide_size'].str.lower()
df['trigger'] = df['trigger'].replace({'Continuous rain':'rain'})
df['trigger'] = df['trigger'].str.lower()
# check results:
print('landslide_type: ', df['landslide_type'].unique(), "\n")
print('landslide_size: ', df['landslide_size'].unique(),"\n")
print('trigger: ', df['trigger'].unique(),"\n")

In [None]:
### Convert to categorical
df['country_code'] = df['country_code'].astype('category')
df['country_name'] = df['country_name'].astype('category')
df['landslide_type'] = df['landslide_type'].astype('category')
df['landslide_size'] = df['landslide_size'].astype('category')
df['trigger'] = df['trigger'].astype('category')

## Country and continent dataset

In [None]:
df_continent = pd.read_csv('../input/countrycodecontinent/countryContinent.csv',delimiter=';')
df_continent = df_continent.drop(columns=['Three_Letter_Country_Code','Country_Number','Continent_Code'])
df_continent.head(2)

In [None]:
df = pd.merge(df,df_continent,how='left',left_on='country_code',right_on='Two_Letter_Country_Code')
print('Continent_Name :', df['Continent_Name'].unique(), "\n")

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='background:#7ca4cd; border:0' role="tab" aria-controls="home"><center>EDA</center></h3>

# Visualizing Categorical Variable
## Hazard location

In [None]:
fig= plt.subplots(figsize=(12,10))

# Label axes and show plot
# ax1 = fig.add_subplot(1,3,1)
ax = sns.countplot(y="country_name", data=df,order = df['country_name'].value_counts().index,palette= 'Dark2')
ax.set_yticklabels(ax.get_yticklabels(),fontsize=15)
ax.set_ylabel('Country name',fontsize=20)
ax.set_xlabel('Count',fontsize=20)

col="survived"

plt.show()

## Hazard type, size and trigger

In [None]:
fig, ax = plt.subplots(3,1,figsize=(12,15))
fig.subplots_adjust(hspace=0.3)

# Label axes and show plot
# ax1 = fig.add_subplot(1,3,1)
sns.countplot(y="landslide_type", data=df,order = df['landslide_type'].value_counts().index,palette= 'Dark2',ax=ax[0])
ax[0].set_yticklabels(ax[0].get_yticklabels(),rotation=0,fontsize=15)
ax[0].set_title('Type of hazard',fontsize=16,y=1.0)

sns.countplot(y="trigger", data=df,order = df['trigger'].value_counts().index,palette= 'Dark2',ax=ax[1])
ax[1].set_yticklabels(ax[1].get_yticklabels(),rotation=0,fontsize=15)
ax[1].set_title('Trigger',fontsize=16,y=1.0)

sns.countplot(y="landslide_size", data=df,order = df['landslide_size'].value_counts().index,palette= 'Dark2',ax=ax[2])
ax[2].set_yticklabels(ax[2].get_yticklabels(),rotation=0,fontsize=15)
ax[2].set_title('Size of hazard',fontsize=16,y=1.0)

plt.show()

- The large majority of slope hazard are made of **landslides** and **mudslides**. Then, we found few rockfalls, debris flows and complex hazards.
- These slope hazard are mostly trigger sudden or continoious precipiation: **downpour** or **rain** events. 

## Relation between trigger, type and size

In [None]:
## create df to look at corralation between type and trigger
df_plot = df.groupby(['landslide_type', 'trigger']).size().reset_index().pivot(columns='trigger', index='landslide_type', values=0)
# plot
df_plot.plot(kind='bar', stacked=True,colormap='Set3', figsize=(14, 7))
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5),fontsize=16)
plt.title('landslide type versus trigger',fontsize=16)
plt.yticks(rotation=0,size=16) 
plt.xticks(rotation=90,size=16) 

## create df to look at corralation between type and size
df_plot1 = df.groupby(['landslide_type', 'landslide_size']).size().reset_index().pivot(columns='landslide_size', index='landslide_type', values=0)
df_plot1.plot(kind='bar', stacked=True,colormap='Set3', figsize=(14, 7))
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5),fontsize=16)
plt.title('landslide size versus type',fontsize=16)
plt.yticks(rotation=0,size=16) 
plt.xticks(rotation=90,size=16) 

## create df to look at corralation between type and size
df_plot = df.groupby(['trigger', 'landslide_size']).size().reset_index().pivot(columns='landslide_size', index='trigger', values=0)
df_plot.plot(kind='bar', stacked=True,colormap='Set3', figsize=(14, 7))
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5),fontsize=16)
plt.title('landslide size versus trigger',fontsize=16)
plt.yticks(rotation=0,size=16) 
plt.xticks(rotation=90,size=16) 

plt.show()



- There is no clear relation between trigger and the type of hazard, execpt than debris flow are only trigger by 'rain' and 'downpour'.
- rockfall are characterize by a proportion of small event.
- Tropical cyclones trigger mostly medium size event. 

## Relation between distance and hazard type

In [None]:
sns.catplot(x='landslide_type',y="distance",data=df.sort_values("distance"),kind='bar',palette='Dark2',
            order=['snow avalanche','debris flow','rockslide','rockfall','mudslide','landslide','complex','creep',
                  'other','riverbank collapse','lahar','unknown'])
#  data=diamonds.sort_values("color")
plt.yticks(size=16) 
plt.xticks(rotation=90,size=16) 

sns.catplot(x='trigger',y="distance",data=df,kind='bar',palette='Dark2',order = ['freeze thaw','earthquake','volcano','downpour','flooding','unknown','rain','snowfall snowmelt', 'tropical cyclone' ,
         'mining digging','other','construction','dam embankment collapse'])
plt.yticks(size=16) 
plt.xticks(rotation=90,size=16) 
plt.show()

- The data suggest that landslide occuring on large distance(e.g.: snow avalanche, debris flow, rockslide, rockfall) occur in montagne where slope are important.
- and so, in these locations freeze thaw, earthquake volcanic eruption are the triggering events. 

## Relation between hazard and period of the year

In [None]:
df['month'] = pd.DatetimeIndex(df['date']).month
df['year'] = pd.DatetimeIndex(df['date']).year
df['count'] = 1

In [None]:
hazard_type_month = df.groupby(['year','month','landslide_type'])['count'].sum().fillna(0)  
hazard_type_month.unstack()

In [None]:
df_NA = df[df['Continent_Name']== 'North America']
df_NA.groupby(['month','landslide_type']).count()['count'].fillna(0).unstack()\
.plot(kind='bar', stacked=True,colormap='Set3', figsize=(14, 7))
plt.show()

df_SA = df[df['Continent_Name']== 'South America']
df_SA.groupby(['month','landslide_type']).count()['count'].fillna(0).unstack()\
.plot(kind='bar', stacked=True,colormap='Set3', figsize=(14, 7))
plt.show()



- In North America, the evolution of the number of slope hazard over the years present a normal distribution with a maximun pic around July and a minimun around December-January. We observe a strong increase in 'debris flow' during the summer time (thunderstorm?)
- In South America, landslide and mudslide occur mostly during two periods: March to May and November-December. It also seems that rockfall occur mostly from July to October.