# Predicting Protein Classification - Pandas

Background:


The dataset consists of many different types of macromolecules of biological signifiance. The majority of the data records are of proteins. With DNA being the precursor to RNA, which when translated, proteins are the biomolecules that are directly interacting in biological pathways and cycles. Proteins are usually centered around one or a few job which is defined by their family type. For example, we can have a protein that is from a Hydrolase group, which focuses on catalyzing hydrolysis (breaking bonds by adding water) in order to help promote destruction of chains of proteins, or other molecules. Another example would be a protein that is a transport protein, which allows other molecules such as sucrose, fructose, or even water come in and outside of the cell.

**Import Pandas and Numpy**


In [4]:
import pandas as pd
import numpy as np

**1. Import Datasets**

There are two data files. Both are arranged on "structureId" of the protein:

pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.

pdb_data_seq.csv contains >400,000 protein structure sequences.

In [21]:
df_seq = pd.read_csv(r"pdb_data_seq.csv")
df_char = pd.read_csv(r'pdb_data_no_dups.csv')

In [22]:
df_seq.head()

Unnamed: 0,structureId,chainId,sequence,residueCount,macromoleculeType
0,100D,A,CCGGCGCCGG,20,DNA/RNA Hybrid
1,100D,B,CCGGCGCCGG,20,DNA/RNA Hybrid
2,101D,A,CGCGAATTCGCG,24,DNA
3,101D,B,CGCGAATTCGCG,24,DNA
4,101M,A,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...,154,Protein


In [23]:
df_char.head()

Unnamed: 0,structureId,classification,experimentalTechnique,macromoleculeType,residueCount,resolution,structureMolecularWeight,crystallizationMethod,crystallizationTempK,densityMatthews,densityPercentSol,pdbxDetails,phValue,publicationYear
0,100D,DNA-RNA HYBRID,X-RAY DIFFRACTION,DNA/RNA Hybrid,20,1.9,6360.3,"VAPOR DIFFUSION, HANGING DROP",,1.78,30.89,"pH 7.00, VAPOR DIFFUSION, HANGING DROP",7.0,1994.0
1,101D,DNA,X-RAY DIFFRACTION,DNA,24,2.25,7939.35,,,2.0,38.45,,,1995.0
2,101M,OXYGEN TRANSPORT,X-RAY DIFFRACTION,Protein,154,2.07,18112.8,,,3.09,60.2,"3.0 M AMMONIUM SULFATE, 20 MM TRIS, 1MM EDTA, ...",9.0,1999.0
3,102D,DNA,X-RAY DIFFRACTION,DNA,24,2.2,7637.17,"VAPOR DIFFUSION, SITTING DROP",277.0,2.28,46.06,"pH 7.00, VAPOR DIFFUSION, SITTING DROP, temper...",7.0,1995.0
4,102L,HYDROLASE(O-GLYCOSYL),X-RAY DIFFRACTION,Protein,165,1.74,18926.61,,,2.75,55.28,,,1993.0


### Variable Description

*structureId:* identity of the structure

*classification:* classification type

*experimentalTechnique:* technique of experiment

*macromoleculeType:* type of macromolecule

*residueCount:* number of residue

*resolution:* amount of resolution

*structureMolecularWeight:* molecular weight

*crystallizationMethod:* method of crystallization

*crystallizationTempK:* crystallization temperature in Kelvin

*densityMatthews:* crystalline density

*densityPercentSol:* resolution ratio by density

*pdbxDetails:* detail about row

*phValue:* PH value

*publicationYear:* published year


**2. Filter and Process Data**

In [24]:
# Filter for only proteins
protein_char = df_char[df_char.macromoleculeType == 'Protein']
protein_seq = df_seq[df_seq.macromoleculeType == 'Protein']

In [25]:
protein_char.head()

Unnamed: 0,structureId,classification,experimentalTechnique,macromoleculeType,residueCount,resolution,structureMolecularWeight,crystallizationMethod,crystallizationTempK,densityMatthews,densityPercentSol,pdbxDetails,phValue,publicationYear
2,101M,OXYGEN TRANSPORT,X-RAY DIFFRACTION,Protein,154,2.07,18112.8,,,3.09,60.2,"3.0 M AMMONIUM SULFATE, 20 MM TRIS, 1MM EDTA, ...",9.0,1999.0
4,102L,HYDROLASE(O-GLYCOSYL),X-RAY DIFFRACTION,Protein,165,1.74,18926.61,,,2.75,55.28,,,1993.0
5,102M,OXYGEN TRANSPORT,X-RAY DIFFRACTION,Protein,154,1.84,18010.64,,,3.09,60.2,"3.0 M AMMONIUM SULFATE, 20 MM TRIS, 1MM EDTA, ...",9.0,1999.0
7,103L,HYDROLASE(O-GLYCOSYL),X-RAY DIFFRACTION,Protein,167,1.9,19092.72,,,2.7,54.46,,,1993.0
8,103M,OXYGEN TRANSPORT,X-RAY DIFFRACTION,Protein,154,2.07,18093.78,,,3.09,60.3,"3.0 M AMMONIUM SULFATE, 20 MM TRIS, 1MM EDTA, ...",9.0,1999.0


In [26]:
protein_seq.head()

Unnamed: 0,structureId,chainId,sequence,residueCount,macromoleculeType
4,101M,A,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...,154,Protein
7,102L,A,MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE...,165,Protein
8,102M,A,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...,154,Protein
11,103L,A,MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAK...,167,Protein
12,103M,A,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...,154,Protein


In [27]:
protein_seq.describe()

Unnamed: 0,residueCount
count,345180.0
mean,4717.870508
std,26527.126728
min,3.0
25%,398.0
50%,856.0
75%,1976.0
max,313236.0


In [29]:
protein_char.columns

Index(['structureId', 'classification', 'experimentalTechnique',
       'macromoleculeType', 'residueCount', 'resolution',
       'structureMolecularWeight', 'crystallizationMethod',
       'crystallizationTempK', 'densityMatthews', 'densityPercentSol',
       'pdbxDetails', 'phValue', 'publicationYear'],
      dtype='object')

In [34]:
protein_char.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
residueCount,127796.0,721.054274,1741.804493,3.0,237.0,416.0,800.0,313236.0
resolution,117007.0,2.208683,1.339642,0.48,1.78,2.04,2.49,70.0
structureMolecularWeight,127796.0,89559.201609,469052.147475,453.55,26726.39,47060.0,91024.9,97730536.0
crystallizationTempK,88637.0,291.000652,8.764282,4.0,290.0,293.0,295.0,398.0
densityMatthews,114340.0,2.652625,0.687324,0.0,2.21,2.48,2.89,13.89
densityPercentSol,114361.0,51.113993,10.011243,0.0,44.16,50.2,57.32,92.0
phValue,95901.0,6.785967,1.272059,0.0,6.0,7.0,7.5,12.0
publicationYear,105929.0,2009.061758,6.461747,1969.0,2005.0,2010.0,2014.0,2018.0


when we look this describe table phValue must be 0-14 but there is max value 100.

publicationYear has a value of 201, which is unlikely.

In [35]:
protein_char.loc[protein_char.publicationYear==201.0]=np.nan

In [36]:
protein_char.loc[protein_char.phValue>14]=np.nan

We will check that the dataframe has been updated:

In [37]:
protein_char.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
residueCount,127796.0,721.054274,1741.804493,3.0,237.0,416.0,800.0,313236.0
resolution,117007.0,2.208683,1.339642,0.48,1.78,2.04,2.49,70.0
structureMolecularWeight,127796.0,89559.201609,469052.147475,453.55,26726.39,47060.0,91024.9,97730536.0
crystallizationTempK,88637.0,291.000652,8.764282,4.0,290.0,293.0,295.0,398.0
densityMatthews,114340.0,2.652625,0.687324,0.0,2.21,2.48,2.89,13.89
densityPercentSol,114361.0,51.113993,10.011243,0.0,44.16,50.2,57.32,92.0
phValue,95901.0,6.785967,1.272059,0.0,6.0,7.0,7.5,12.0
publicationYear,105929.0,2009.061758,6.461747,1969.0,2005.0,2010.0,2014.0,2018.0


In [33]:
# Select  some variables to join
protein_char = protein_char[['structureId','classification','experimentalTechnique','residueCount', 'resolution',
       'structureMolecularWeight','crystallizationTempK', 'densityMatthews', 'densityPercentSol', 'phValue', 'publicationYear']]
protein_char.head()

Unnamed: 0,structureId,classification,experimentalTechnique,residueCount,resolution,structureMolecularWeight,crystallizationTempK,densityMatthews,densityPercentSol,phValue,publicationYear
2,101M,OXYGEN TRANSPORT,X-RAY DIFFRACTION,154.0,2.07,18112.8,,3.09,60.2,9.0,1999.0
4,102L,HYDROLASE(O-GLYCOSYL),X-RAY DIFFRACTION,165.0,1.74,18926.61,,2.75,55.28,,1993.0
5,102M,OXYGEN TRANSPORT,X-RAY DIFFRACTION,154.0,1.84,18010.64,,3.09,60.2,9.0,1999.0
7,103L,HYDROLASE(O-GLYCOSYL),X-RAY DIFFRACTION,167.0,1.9,19092.72,,2.7,54.46,,1993.0
8,103M,OXYGEN TRANSPORT,X-RAY DIFFRACTION,154.0,2.07,18093.78,,3.09,60.3,9.0,1999.0


In [38]:
protein_seq = protein_seq[['structureId','sequence']]
protein_seq.head()

Unnamed: 0,structureId,sequence
4,101M,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
7,102L,MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE...
8,102M,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
11,103L,MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAK...
12,103M,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...


*We can now perform a join using structureId as the index. We'll utilize pandas 'join' method. To do this, we have to set the index for each dataframe to be 'structureId'.*

**Join columns of another DataFrame:**

**df.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)**

Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.

One option of using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in df. This method preserves the original DataFrame’s index in the result.

In [35]:
# Join two datasets on structureId
model_f = protein_char.set_index('structureId').join(protein_seq.set_index('structureId'))
model_f.head()

Unnamed: 0_level_0,classification,experimentalTechnique,residueCount,resolution,structureMolecularWeight,crystallizationTempK,densityMatthews,densityPercentSol,phValue,publicationYear,sequence
structureId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
101M,OXYGEN TRANSPORT,X-RAY DIFFRACTION,154.0,2.07,18112.8,,3.09,60.2,9.0,1999.0,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
102L,HYDROLASE(O-GLYCOSYL),X-RAY DIFFRACTION,165.0,1.74,18926.61,,2.75,55.28,,1993.0,MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSE...
102M,OXYGEN TRANSPORT,X-RAY DIFFRACTION,154.0,1.84,18010.64,,3.09,60.2,9.0,1999.0,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...
103L,HYDROLASE(O-GLYCOSYL),X-RAY DIFFRACTION,167.0,1.9,19092.72,,2.7,54.46,,1993.0,MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNSLDAAK...
103M,OXYGEN TRANSPORT,X-RAY DIFFRACTION,154.0,2.07,18093.78,,3.09,60.3,9.0,1999.0,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...


In [36]:
print('%d is the number of rows in the joined dataset' %model_f.shape[0])

346317 is the number of rows in the joined dataset


The two dataframes have officially been joined into one with 346,325 proteins. The data processing is not finished as it's important to take a look at the misingness associated with the columns.

In [37]:
# Check NA counts
model_f.isnull().sum()

classification                   3
experimentalTechnique            2
residueCount                     2
resolution                   16093
structureMolecularWeight         2
crystallizationTempK        102391
densityMatthews              38844
densityPercentSol            38710
phValue                      87195
publicationYear              50964
sequence                         5
dtype: int64

With 346,325 proteins, it appears that simply removing missing values is acceptable.

**3. Drop rows with missing values**

In [38]:
model_f = model_f.dropna()
print('%d is the number of proteins that have a classification and sequence' %model_f.shape[0])

176001 is the number of proteins that have a classification and sequence


**4. Types of family groups:**

In [39]:
# Look at classification type counts
counts = model_f.classification.value_counts()
counts.head(10)

HYDROLASE                        23716
TRANSFERASE                      18709
OXIDOREDUCTASE                   17718
IMMUNE SYSTEM                    10031
HYDROLASE/HYDROLASE INHIBITOR     9166
LYASE                             6368
TRANSCRIPTION                     4759
TRANSPORT PROTEIN                 4190
VIRAL PROTEIN                     4120
ISOMERASE                         3253
Name: classification, dtype: int64

*Displays the end of the dataframe:*

In [40]:
counts.tail(10)

TRANSPORT PROTEIN/ANTAGONIST            1
Transferase, Signaling protein          1
HYDROLASE, ANTITUMOR PROTEIN            1
Ligase, Transferase                     1
ANTIVIRAL PROTEIN, HYDROLASE            1
PROTEIN TRANSPORT, TOXIN                1
DYE-BINDING PROTEIN                     1
CHAPERONE REGULATOR                     1
IMMUNE SYSTEM, LIPID BINDING PROTEIN    1
lipid transport/activator               1
Name: classification, dtype: int64

**5. Calculating the amount of the different classification types in the datagfame:**

There appears to be a wide distribution of counts for family types. We can do filter for having a certain amount of recordes that are of a specific family type:

In [41]:
# Get classification types where counts are over 1000
types = np.asarray(counts[(counts>1000)].index)

In [42]:
types W3

array(['HYDROLASE', 'TRANSFERASE', 'OXIDOREDUCTASE', 'IMMUNE SYSTEM',
       'HYDROLASE/HYDROLASE INHIBITOR', 'LYASE', 'TRANSCRIPTION',
       'TRANSPORT PROTEIN', 'VIRAL PROTEIN', 'ISOMERASE',
       'SIGNALING PROTEIN', 'LIGASE', 'PROTEIN BINDING',
       'TRANSFERASE/TRANSFERASE INHIBITOR', 'MEMBRANE PROTEIN',
       'SUGAR BINDING PROTEIN', 'STRUCTURAL PROTEIN',
       'DNA BINDING PROTEIN', 'CHAPERONE', 'METAL BINDING PROTEIN',
       'CELL ADHESION', 'ELECTRON TRANSPORT', 'PROTEIN TRANSPORT',
       'UNKNOWN FUNCTION', 'TOXIN', 'PHOTOSYNTHESIS', 'CELL CYCLE',
       'OXIDOREDUCTASE/OXIDOREDUCTASE INHIBITOR', 'GENE REGULATION',
       'RNA BINDING PROTEIN'], dtype=object)

*Filter dataset's records and drop_duplicates*

In [43]:
# Filter dataset's records for classification types > 1000
filter_df =model_f[model_f.classification.isin(types)] 


In [44]:
filter_df

Unnamed: 0_level_0,classification,experimentalTechnique,residueCount,resolution,structureMolecularWeight,crystallizationTempK,densityMatthews,densityPercentSol,phValue,publicationYear,sequence
structureId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1A4S,OXIDOREDUCTASE,X-RAY DIFFRACTION,2012.0,2.10,217689.59,287.0,2.43,41.00,7.5,1998.0,AQLVDSMPSASTGSVVVTDDLNYWGGRRIKSKDGATTEPVFEPATG...
1A4S,OXIDOREDUCTASE,X-RAY DIFFRACTION,2012.0,2.10,217689.59,287.0,2.43,41.00,7.5,1998.0,AQLVDSMPSASTGSVVVTDDLNYWGGRRIKSKDGATTEPVFEPATG...
1A4S,OXIDOREDUCTASE,X-RAY DIFFRACTION,2012.0,2.10,217689.59,287.0,2.43,41.00,7.5,1998.0,AQLVDSMPSASTGSVVVTDDLNYWGGRRIKSKDGATTEPVFEPATG...
1A4S,OXIDOREDUCTASE,X-RAY DIFFRACTION,2012.0,2.10,217689.59,287.0,2.43,41.00,7.5,1998.0,AQLVDSMPSASTGSVVVTDDLNYWGGRRIKSKDGATTEPVFEPATG...
1A6Q,HYDROLASE,X-RAY DIFFRACTION,382.0,2.00,42707.55,277.0,2.97,59.00,5.0,1996.0,MGAFLDKPKMEKHNAQGQGNGLRYGLSSMQGWRVEMEDAHTAVIGL...
...,...,...,...,...,...,...,...,...,...,...,...
6F6P,HYDROLASE,X-RAY DIFFRACTION,424.0,2.45,47994.95,291.0,2.61,56.00,7.3,2018.0,GASSRLRSPSVLEVREKGYERLKEELAKAQRELKLKDEECERLSKV...
6F6P,HYDROLASE,X-RAY DIFFRACTION,424.0,2.45,47994.95,291.0,2.61,56.00,7.3,2018.0,GAASRLRSPSVLEVREKGYERLKEELAKAQRELKLKDEECERLSKV...
6F6S,VIRAL PROTEIN,X-RAY DIFFRACTION,497.0,2.29,58337.03,293.0,3.83,67.89,5.2,2018.0,ETGRSIPLGVIHNSALQVSDVDKLVCRDKLSSTNQLRSVGLNLEGN...
6F6S,VIRAL PROTEIN,X-RAY DIFFRACTION,497.0,2.29,58337.03,293.0,3.83,67.89,5.2,2018.0,EAIVNAQPKCNPNLHYWTTQDEGAAIGLAWIPYFGPAAEGIYIEGL...


In [45]:
filter_df1 = filter_df.drop_duplicates(subset=["classification","sequence"])
filter_df1 

Unnamed: 0_level_0,classification,experimentalTechnique,residueCount,resolution,structureMolecularWeight,crystallizationTempK,densityMatthews,densityPercentSol,phValue,publicationYear,sequence
structureId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1A4S,OXIDOREDUCTASE,X-RAY DIFFRACTION,2012.0,2.10,217689.59,287.0,2.43,41.00,7.5,1998.0,AQLVDSMPSASTGSVVVTDDLNYWGGRRIKSKDGATTEPVFEPATG...
1A6Q,HYDROLASE,X-RAY DIFFRACTION,382.0,2.00,42707.55,277.0,2.97,59.00,5.0,1996.0,MGAFLDKPKMEKHNAQGQGNGLRYGLSSMQGWRVEMEDAHTAVIGL...
1A72,OXIDOREDUCTASE,X-RAY DIFFRACTION,374.0,2.60,40658.50,277.0,2.30,46.82,8.4,1998.0,STAGKVIKCKAAVLWEEKKPFSIEEVEVAPPKAHEVRIKMVATGIC...
1A8O,VIRAL PROTEIN,X-RAY DIFFRACTION,70.0,1.70,8175.72,277.0,2.21,43.80,8.0,1997.0,MDIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANP...
1ACC,TOXIN,X-RAY DIFFRACTION,735.0,2.10,82849.97,277.0,2.30,47.00,6.0,1997.0,EVKQENRLLNESESSSQGLLGYYFSDLNFQAPMVVTSSTTGDLSIP...
...,...,...,...,...,...,...,...,...,...,...,...
6F5U,VIRAL PROTEIN,X-RAY DIFFRACTION,498.0,2.07,57299.72,293.0,3.48,64.67,5.2,2018.0,EAIVNAQPKCNPNLHYWTTQDEGAAIGLAWIPYFGPAAEGIYIEGL...
6F6P,HYDROLASE,X-RAY DIFFRACTION,424.0,2.45,47994.95,291.0,2.61,56.00,7.3,2018.0,GAASRLRSPSVLEVREKGYERLKEELAKAQRELKLKDEECERLSKV...
6F6P,HYDROLASE,X-RAY DIFFRACTION,424.0,2.45,47994.95,291.0,2.61,56.00,7.3,2018.0,GASSRLRSPSVLEVREKGYERLKEELAKAQRELKLKDEECERLSKV...
6F6S,VIRAL PROTEIN,X-RAY DIFFRACTION,497.0,2.29,58337.03,293.0,3.83,67.89,5.2,2018.0,ETGRSIPLGVIHNSALQVSDVDKLVCRDKLSSTNQLRSVGLNLEGN...


In [46]:
filter_df1.shape

(36146, 11)

In [47]:
print(types)
print('%d is the number of records in the final filtered dataset' %filter_df1.shape[0])

['HYDROLASE' 'TRANSFERASE' 'OXIDOREDUCTASE' 'IMMUNE SYSTEM'
 'HYDROLASE/HYDROLASE INHIBITOR' 'LYASE' 'TRANSCRIPTION'
 'TRANSPORT PROTEIN' 'VIRAL PROTEIN' 'ISOMERASE' 'SIGNALING PROTEIN'
 'LIGASE' 'PROTEIN BINDING' 'TRANSFERASE/TRANSFERASE INHIBITOR'
 'MEMBRANE PROTEIN' 'SUGAR BINDING PROTEIN' 'STRUCTURAL PROTEIN'
 'DNA BINDING PROTEIN' 'CHAPERONE' 'METAL BINDING PROTEIN' 'CELL ADHESION'
 'ELECTRON TRANSPORT' 'PROTEIN TRANSPORT' 'UNKNOWN FUNCTION' 'TOXIN'
 'PHOTOSYNTHESIS' 'CELL CYCLE' 'OXIDOREDUCTASE/OXIDOREDUCTASE INHIBITOR'
 'GENE REGULATION' 'RNA BINDING PROTEIN']
36146 is the number of records in the final filtered dataset


## More exercises

In [48]:
df_exe=filter_df1.copy()

In [49]:
df_exe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36146 entries, 1A4S to 6F8P
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   classification            36146 non-null  object 
 1   experimentalTechnique     36146 non-null  object 
 2   residueCount              36146 non-null  float64
 3   resolution                36146 non-null  float64
 4   structureMolecularWeight  36146 non-null  float64
 5   crystallizationTempK      36146 non-null  float64
 6   densityMatthews           36146 non-null  float64
 7   densityPercentSol         36146 non-null  float64
 8   phValue                   36146 non-null  float64
 9   publicationYear           36146 non-null  float64
 10  sequence                  36146 non-null  object 
dtypes: float64(8), object(3)
memory usage: 3.3+ MB


**6. Find the ten most recently published proteins**

**7. Write a program that finds the percentage distribution of PH values according to Acid-Base-Neutral Balance.**

**Acidic:**

**Base:**

**Neutral:**

In [50]:
perc=
perc

SyntaxError: invalid syntax (3082460224.py, line 1)

**8.Use groupby to find for each crystallization temperature which proteins crystallize at that temperature.**

**9.Create a new dataframe which contain only the maximum value of the column 'structureMolecularWeight' for each classification:**